Generative AI Application Retrofit – part 2 - Behind the Scenes (S1:E2)

Link to movie in LinkedIn

In this post, I'll take you behind the scenes of my prior post. I'll delve into what it takes to implement generative AI for the two features we previously explored. In my view, discussing the envisioned requirements isn't sufficient; a deep understanding of the technology is essential to leverage its full potential. While anyone can read the API specifications, applying that API in a real-world scenario invariably reveals additional insights that aren't apparent from the specifications alone.

Summarize Source for a Synopsis

Here is the screen design for the UI, with the button to perform the Summarize Source selected.

Layout and Button for Summarize Source

In this instance, I updated my “SummarizeSource” function to include a call to the ChatGPT API. This enhancement involved adding just a single line to that function, as highlighted below:

Function call added to original “SummarizeSource” function

The two critical lines in the code are highlighted below in grey and yellow. As you can see, the actual ChatGPT prompt, defined in the $userPrompt variable, is quite straightforward. Currently, it's hardcoded in English, so it's not yet ready for production. The Insert from URL call executes the actual request to the ChatGPT API using cURL. Additionally, the $formattedText variable captures the text from the API response, which is invaluable for refining the prompt and providing feedback to the user.

Function “Send Synthetic Summarize Story Source”

Also, it's important to note that the purpose of the synopsis is somewhat flexible. It's designed to remind the user of the key points covered in the full article and can be further refined by the author for enhanced accuracy.

Natural Language Processing (NLP) Navigation

My second retrofit feature introduces a Command Line Interface (CLI) accessible throughout the application, enabling users to easily navigate to the appropriate screen and record. This feature is far more ambitious and complex. It required a user interface that had to be compact, prominently placed for easy access, and responsive, automatically hiding when not in use. My goal was to create a minimalistic design, reminiscent of the familiar ChatGPT interface. Additionally, I focused on creating a seamless flow, providing user feedback as generated by ChatGPT to clarify how ChatGPT interpreted the user’s input.

Layout with Popup button calling function “Send Synthetic CLI Selector Prompt to Categorize Initial Input”

Navigate to the Screen

Retrofitting ChatGPT into an existing application presents a unique challenge: ChatGPT has a general understanding of language but lacks specific knowledge about the application. To bridge this gap, I provided contextual information about the application in addition to the user input. The fundamental approach involved describing each screen in the application in sufficient detail so that ChatGPT can accurately map user intentions to the corresponding application screen.

Initially, I utilized the existing help text for each screen to provide this context. While this approach was somewhat effective, it wasn't efficient and wasted tokens. Consequently, I revised the descriptions for each screen to ensure they were distinct and accurately reflected the actions a user would perform on them.

Ultimately, the process involved calling a function to construct the prompt. In this function, I first identified myself as a writer, then described the application screens, and finally instructed ChatGPT on how to interpret these screens. This enabled a precise mapping of the user input to the appropriate screen in the application.

Building the Prompt for the NLP Navigation in “Generate NLP Nav Prompt” function

Currently, this prompt is hardcoded into the application, but the plan is to transition to a table-driven approach. The segment of the prompt that deals with interpreting this context is as follows:

$NavContext & Char(13) & "I want determine the location to navigate to in the software application. Here is the phrase to evaluate against the Locations and their Descriptions:  " & Char(13) & "Give me the LocationID that matches the input."

The entirety of the process described above serves as context. I then appended the user input, as entered in the UI, to the end of this context and sent the complete string in the API call. This is why I referred to the process as 'synthetic' – I needed to construct this comprehensive context for the app in addition to incorporating the user input.

The key here is obtaining the LayoutID, which the app uses to navigate to the appropriate layout or screen. For this purpose, I've employed the ChatGPT 3.5 turbo model, as it offers both speed and adequacy for the task, guiding me to the correct layout.

Without ChatGPT, coding for this type of functionality would likely be very fragile. Predicting user input and identifying the relevant keywords or phrases to interpret and mapping them to a specific screen is challenging. This is where ChatGPT shines. However, it also underscores the importance of thorough testing to determine the most suitable model to apply. Often, you'll find that the initial output doesn't precisely match your requirements, necessitating revisions to the prompt for greater specificity. Being verbose and phrasing things in a way that might seem cumbersome to a human can actually be beneficial for ChatGPT to interpret. Detailed instructions to the Language Learning Model enhances the likelihood of achieving the desired output.

Navigate to a Record

Having navigated to the correct screen, the next step is to locate the specific record. Ideally, one could describe every record in the database to the Language Learning Model (LLM) and simply receive the record ID to access. However, this approach is impractical due to the enormous token consumption and potentially slow performance. Therefore, a second API call is necessary to reinterpret the user input in the context of the mapped layout or screen. Consider the following user input as an example:

I want to edit my story “When Worlds Meet”.

In this case, the initial prompt determined that the Edit mode of the Story screen was the appropriate response to the input. Interestingly, the story's title wasn't necessary for determining the correct screen. However, with the context now established, the phrase "When Worlds Meet" is key to identifying the necessary record. By leveraging ChatGPT to extract the title "When Worlds Meet," the application can then perform a find operation to navigate to that specific record.

This method is effective in scenarios where user input is likely to be clear and interpretable. In more complex real-world applications, it might be beneficial to construct a vector database representation of the data, which could then be used to locate the correct record.

The challenge in interpreting records lies in identifying the additional information provided by the user. Does it contain a proper name for a story, publisher, submission, or bookmark? Or perhaps a personal name for a contact, or a date for a calendar record? Users may or may not enclose names in quotes, and there's no control over the format of user entries. Understanding these nuances is crucial for accurate record identification and navigation.

Function “Send Synthetic Nav Prompt”

Here is the prompt I used for extracting the proper name from the user input in the function called on line 62 above:

Does the following input contain a reference to a proper name? If the input contains an item in quotes, that should also be considered a Proper Name. If it does, then answer with that proper name only, and if it does not, then answer with 'None'. Here is the input:

I discovered that the ChatGPT 3.5 turbo model wasn't accurately interpreting which portion of the user input constituted a proper or personal name. To address this, I switched to the ChatGPT-4 model, which significantly improved the results. Consequently, I updated my Bring Your Own License (BYOL) setup to allow the specification of the model to be used for high-fidelity calls. While these calls are more expensive, the tradeoff is justified in this context. The prompts are much shorter and consume fewer tokens, making the additional cost worthwhile for the enhanced accuracy.

Be sure to FOLLOW me for my next update where I generate a summary of what’s important to the user when they first log in (S1:E3).