Generative AI Application Retrofit (S1:E1)

Generative AI Application Retrofit
Movie for S1:E1 in LinkedIn

Welcome to my series of posts chronicling my journey with the ChatGPT API. As a Product Manager, I'm excited to share real-world applications of Generative AI. To that end , I will be using an application I wrote in FileMaker which employs a scripting language, and packages up my API calls in JSON. At the end is also a movie I made on this topic, showing everything in this post, in action. So, without further ado, I’ll get started.

I find ChatGPT to be a bit like magic, and also a bit sloppy in terms of the response from the API, despite efforts to force it into a more controlled output to make it easier to use. My application, The Writer’s Scribe (TWS), takes stories or articles, and creates a submission to one or more publishers to track the status of your submissions. While the premise is simple, complexity arises in managing rules around multiple and simultaneous submissions, publisher availability, genre preferences, and submission timelines.

To retrofit generative AI to my TWS application, I first looked at the types of problems that would benefit the most from using ChatGPT, and that were easy to do. I already had a function to summarize the source text for an article, and I already had a synopsis field. Many publishers require a synopsis to be included in the cover letter for a submission, so TWS has a template that generates a query letter that includes the synopsis. My goal was to have ChatGPT create a summary that could be used as the synopsis.

The ChatGPT API call works well when I assign a limit of 100 words to the summary and use it on articles under roughly 2K words. Since I have a number of short stories, this call works well for this purpose. Below is a screenshot where I initiate the summary, and where the summary is displayed in the synopsis field.

Analyze Source for Synopsis

In this case, I did not make it apparent that I generated the synopsis via AI as it is simply a synopsis and I expect the user will review the synopsis as needed. If it fails, the synopsis is simply left empty. I set the ChatGPT temperature to zero, so hallucinations are not a concern since this reduces the level of creative variations. My takeaway for this feature is that it is important to select the appropriate model for the API call for handling the input. In this case, I used “gpt-3.5-turbo-16k”. I did this using a paid account for the API calls, and over a period of several days, the cost was under $2 for my testing and generation of several summaries. When creating summaries, my message or input will be larger than my output, so in the future, I may adjust the call since my input/output is asymmetric, and I want to minimize my token use.

ChatGPT Bring Your Own License Setup

My second retrofit feature was more ambitious. I wanted to add a Command Line Interface (CLI) to the product that would be available throughout the application. This CLI would allow the user to navigate to the appropriate screen, and where possible, to the specific record. It is purely navigational in nature, so it is relatively safe to execute, and if it fails, it tells the user why it failed using the API response. In the future, I would like to expand the CLI to perform queries, commands, and other higher-level processes.
Shown below is one of the screens where I added a button to the top of the screen to show a popover for the CLI. This button is placed at the same position on every screen with the same popover for consistency. To use this interface, the user clicks the button, enters the free-form text for their input, and then they hit “Enter” or the Send button.

AI Popover to Chat for Navigation

This is where the magic comes in using generative AI. To perform simple navigation, I need to provide the API with some context to interpret the input against. I basically need to describe each screen in the application, what it does, and what kind of objects it operates on. In this case, I had already implemented a simple Help popover on each screen that described what that screen does, so I used that to create a prompt that associated each screen with a description. I then asked ChatGPT to determine the best fit to map the user’s input against the available destinations. In this case, the user input was: “Edit my story ‘When Worlds Meet’.”

Story Record with AI Popover with User Feedback

I found that this would be far too much information to expose to the user regarding the full context used for the prompt. So I created what I called a synthetic prompt where I took the context generated from the screen descriptions and screen IDs, and then appended the user input to the end of it. In general, this worked well.

The question remained, does this approach deliver the value my users expect? Since this is not actually adding any new capabilities and only addresses the User Experience, it is adding only limited value. Still, users that eschew clicking and want to simply ask the application to do what they need will benefit. This is especially true of new users or those who use the application on an infrequent basis. So, for this group, it is a valuable feature.

Next, does this feature work? I found this is a more complex problem to solve. I have no control over what the user enters, and generally I have a problem of “garbage in, garbage out”. To resolve this, I decided to add the ability to log all of the calls and allow the user to enter favorable or unfavorable feedback on each call, along with a comment. I anticipate a need to refine the ChatGPT prompts, and this type of input will help because it gives me real world examples.

I found that simply navigating to the correct tab is not a real win in the application because, after all, the user could simply click on a tab and navigate to a given record quickly. So I expanded the navigation to include not only the correct screen, but also the correct record. I found that ChatGPT is able to extract from the user input whether it included a Personal Name or a Proper Name (e.g. a person’s name versus the name of a story). Given that, based upon the navigation target, the application could then perform a find using that name to move to the record or records that match. In general, my ChatGPT calls about 3 seconds after I refined the context. So this approach is a win as it supports a chat-like interface so the user does not need to know the structure of the information.

To conclude today’s post, here is the interface for the log entries I captured for each of these CLI interactions. This gives me good user feedback so I know where to focus my efforts to improve the prompts. It also provides performance information, which is always important for a good UX.

Log of Generative AI User Prompts

Thank you for reading this post. I hope it provides a better appreciation of the problems you can solve in an application retrofit. See below for a short movie that illustrates the above. Next time, I’ll delve into my work to refine the CLI as I expand it to cover commands, in which I do some much more exciting, and dangerous, enhancements!

Be sure to FOLLOW me for my next update (S1:E2).


@dswatski - great post exploring the possibilities of using a LLM to augment UX. One thought, you mention:

I don’t think that setting the temperature to zero will do anything to prevent hallucination, and will only ensure that you get the exact same response from the model every time you use the exact same prompt.

Thank you for the comment. I did some research. It turns out that in the current generation of LLM's, there is no way to ensure that it does not hallucinate. Yes, setting the temperature to zero makes it deterministic. So if I said to it, Is the RGB value of 255,0,0 the color red? It will come back and say yes, and a lot more. But apparently, if I do that 5,000 times, it may at one point come back and tell me it is the color purple. In fact, one paper I read said that setting the temperature to zero may result in more hallucinations, presumably if the training set contains varying information on the given topic.

At the moment, I am not as interested in it being very creative, but more in it giving an accurate and reproducible summary or interpretation. When I asked ChatGPT itself whether there is a relationship between the temperature setting and hallucinations, it said yes, there is a higher likelihood of it hallucinating with a higher temperature. I think it really also depends on the prompt and the data, both in the training set and the data provided for it to use as part of the prompt. For example, I found that when I asked for a given value to be extracted, and I varied the order of the data extraction rules, the order that I posed the constraints had an impact on the answer it gave.

I am also finding that it does make a difference based upon the model used. In general, ChatGPT 4 gives me better results than 3.5, especially where it needs to interpret what is in the data provided. I have been impressed with what it can do, but it comes down to whether you can trust it to give you the right answer every time, and if not, how much error are you willing to accept? So for some of my problems, I found 3.5 is fast and is good enough to get to the accuracy I want, and performance matters. For other prompts that run less frequently, I found I needed v4 to get to the required accuracy level. Also, v4 is much, much more expensive to use than 3.5

If you have any experience on what temperature settings you have used, I would be very interested in hearing. I am currently using zero on everything. For writing correspondence, I think I can up it a bit so it creates more interesting output. When I asked ChatGPT Plus, what temperature setting it uses, it gave a long answer, but basically said about ".7". Again, thanks for the question. It made me think and evaluate how to best use this.