Generative AI Application Retrofit – Extracting Structured Information

Link to movie on LinkedIn.

In today's update, I'm pleased to share my latest experience with generative AI in extracting structured information.

One of the strengths of generative AI is a rudimentary ability to read and extract meaning. I previously leveraged this in generating a summary of a story. Today’s example takes this much further. I provided input on publishers and then asked the LLM to create a JSON structure with each element that corresponded to fields in my database. I effectively wrote prompts that acted as fuzzy rules for extracting the data, categorizing it into pre-defined values, and formatting it so it can readily be used.

Leveraging AI to Identify and Extract Data

My initial effort at extracting data was to look at a publisher’s website and determine whether they were currently open for submissions. Simple enough, but vexing since this information is not always on their home page. Not all publishers actually post this information, and I would say that across the spectrum, they are devilishly clever in finding different ways to phrase or otherwise indicate whether they are open or closed for submissions. Additionally, submissions may be ingested in a variety of ways, and with varying submission systems. Some of a publisher’s publications may be open for submission, while others are closed. This makes for a complex environment to interpret.

Extract Popover on Publisher’s Edit Screen

Given enough information coupled with a detailed prompt, I achieved a very high rate of success in evaluating for open submission status. I then thought, why stop there? So I moved on to forcing the LLM to return a JSON structure containing more information. I also fed the LLM a lot of an extensive prompting with both detailed instructions and a lot of data. I found that this was more efficient and cost effective since I did not know where on the websites publishers would place this information, and it was often scattered across multiple pages.

Crafting the Prompt

My prompt evolved as I ran it against more websites. I generally forced the reply into a given set of responses, such as YES, NO, or INCONCLUSIVE. For numbers, I would ask for the number of days for a response, and if they listed this information on the website in weeks or used words to spell out the days, weeks, or months, I would have the LLM transform this answer into the number of days. In the end, the prompt was quite involved, nearly 1800 words long, but I cannot imagine even trying to do this with regular code to achieve this level of accuracy.

Publisher Extract Prompt

The approach I took was to add two fields to my Publisher record for the JSON output and the timestamp for when the JSON was updated. I then displayed the extracted data and highlighted where the information differed from the existing record. It normally took between 11 and 20 seconds to respond when I sent the LLM between 4K and 10K tokens, and then received about 100 to 200 tokens back. I had a few outliers that took upwards of 40 seconds to respond, but these were rare.

Value of the Response

I found the LLM could read the input and make sense out of it many times faster than a human. It became addictive to run the queries to see whether it could figure out what was needed. Giving it examples to use as part of the prompt was very effective. I thought my original analysis of the publisher websites were accurate. I was mistaken to some extent; I found the AI-generated results were more thorough. Using this iterative approach of constantly comparing the output, I basically answered the question of: can I trust the results? At some level, depending on the problem, and the nature of the input you can provide, yes, you can!

I found that even a human could never return 100% accurate results, and often the information on the websites would change frequently so that the information I had garnered previously was stale. The publisher websites also varied widely in their content, structure, and terminology. Having the ability to automate this analysis resulted in more accurate and timely information to work with. I also did not directly update the data, but displayed the intermediate results so the user could apply them to the record and do a final check. In the end, it greatly improved the information I was working with.

In the end, I spent a little over $50 to update over 800 records. If I did this manually, the cost of my time alone would far exceed this figure and take much longer. My use of tokens with the prompt was very asymmetric, which I believe is very good for productivity. Typically about 40:1 of input to output. Basically, no human would want to take the time to be as accurate as the LLM would be in performing this task.

Quantitative Results

How accurate or effective was this approach? Let’s take this apart into the separate elements:

Name: This is a good check because often, one company buys another, and it is easy to introduce typos into the name, especially with odd spacing and capitalization. A win.

Description: I kept it to 75 words, and these were nearly always good, though a bit redundant with the other information on the page. I set the temperature to zero, but I did not control the seed. So every time I refreshed the extract, the description would always change slightly. This is an annoyance when this is not a material change and I just accept the change.

For the next set, I ran some metrics on the counts to determine how effective this approach was. This dialog is shown upon pressing the “Summary Counts” button at the bottom right corner of the Extract popover (seen in a figure above), and here are the Summary results:

Summary Counts

I did some pre-processing in my prompt. The extraction of whether a publisher is a News organization was useful in determining that they would always be open to letters to the editor and op-eds even though they don’t explicitly ever say that they are “open”.

For the others, I swapped back and forth to the website and record to compare the actual results. These were the final results after I finished making my refinements to the prompt. Since this is across about 800 sample records, it is fairly repeatable. My extract popover will show items in RED if they differ between the extract and the specific Publisher record. In the end, my goal was to eliminate all of the red. So, for accuracy, some could not be eliminated due to oddities by a few publishers in how they list their information.

Open Status: You can see that the accuracy for determining open/closed was extremely high, at 90%+. I also tried to extract information on publisher reading periods, but the prompt failed to pick up on these date ranges effectively (Inconclusive), mostly due to the fact that some 65 publisher sites had broken pages, re-directs, or other issues making them impossible to process. I noted this as the efficacy of evaluating for this information.

Multiple Allowed: The wording for whether a publisher allows multiple submissions varied widely, and was often buried deep down in a long set of instructions. Many publishers do not mention this parameter. So, while an efficacy of 48% appears low, this actually reflects the nature of the data. So this was the result of a lack of data. A win.

The remaining metrics were similar to the above, and as I said above, I created a set of fuzzy rules to guide the LLM to extract the data that I needed. It will be interesting to see how these results change over time as the LLMs continue to improve.

Behind the Scenes

For running the prompt, I wrote the function ‘Send Synthetic Publisher ALL Refresh Prompt’ function. I called this function for each record. This took about 15 to 20 seconds to run with 5 to 10 thousand tokens. Therefore, it took hours to run to do a full refresh of 800 records, and using ChatGPT-4, it incurred some noticeable costs. Still, I believe the results were quantitatively and qualitatively better than a manual approach. This is especially true if you need to refresh this information on a regular basis.

For my next post, I'll delve into the use of generative AI to create a Chatbot that works with my online help and tutorials. For more insights and updates, don't forget to follow me!


ChatGPT: "Bye Bye programmers! (not today, but ... soon)"