Engineering with AI 3: A Real Application
Prompt engineering is an exploration of how to "condition" an LLM to optimize its results and behavior. I'm not sure I'm a fan of the term, really, because it's a stretch to call it "engineering". That's not to say it's entirely unscientific...but a lot of prompt engineering seems to be trial and error where theory doesn't always apply universally.
Let's Build a Backend
Based on our previous article about prompt engineering, I know that this is going to be an iterative process. More than this, I know that I want it to be a structured process, too, where I can apply prompts to multiple models within openAI...or use an entirely different platform entirely. That makes it a bit more complex than just making a simple backend that can call openAI and bake some static content with the results.
There's a few reasons I decided to use Laravel for this backend. First, I already know it...and for projects like this, that's a good enough reason. Second, it's very high level and opinionated, which is what I want for a project like this. I don't need all of Symfony's granularities when my focus is slapping down a prototype quickly. Nor do I need a compiled language like Go, as easy as it is...it's still not as "plug and play" as Laravel.
The backend structure is fairly simple, but also involves a bit more than just calling and API and throwing the results in a database. Here's a quick high level overview of the models we'll build:
- SystemRole: This contains our "role" prompt along with all the key tuning params we covered in the linked article above like max tokens, frequency penalty, model, etc.
- DataRun: This represents a complete iteration of all the data we need. It's possible that a given role prompt might have multiple iterations because the data returns can be different, so we need a good way to track each distinct crawl of data from OpenAI. This also gives us persistence in case we run into an error -- we can resume where we left off easily. Each DataRun has one "SystemRole" and many "Histories".
- History: This is the actual data returned from the API. We organize it by month and year. We decompose the JSON and store each event as a separate field. Belongs to a "DataRun".
Fairly easy and simple, right? This way, we can build multiple prompts...or even expand it slightly and pull data from multiple distinct APIs if we want. In a production environment, this is really important because especially trying to build content with AI is tricky and requires a lot of iteration. Some models or platforms might be more specialized for what you're trying to accomplish.
See the full repository of this code here.
Coding the API Pull
The openAI API itself is very easy, so we don't need to overthink a lot of the fundamentals, here. Again, the main focus isn't on how we build the backend...but how the OpenAI API works. Considering the bold, bold claims from AI mavens that this tech can and will do everything better than conventional software engineering, it's worth actually testing the tech and understanding actual use cases beyond marketing propaganda.
With all this in place, let's ask for a historical summary of the year 1900. We will provide a month and year (e.g. January 1900) and instruct the AI to return 3 events that happened in the world on that month and year and to return those short summaries as JSON. Let's look at the results!
1/1900
In January 1900, the Boxer Rebellion in China came to an end as foreign powers defeated the Boxer rebels and officially lifted the siege of the legations in Beijing.
In January 1900, the Commonwealth of Australia was established through the federation of six separate colonies, marking the birth of the modern nation of Australia.
In January 1900, the renowned art critic and social thinker John Ruskin passed away, leaving behind a legacy of influential writings on art and society.
2/1900
The Boxer Rebellion, an anti-imperialist uprising in China, began in response to foreign influence and presence in the country, leading to a violent conflict between the Boxers and foreign powers.
The Second Boer War, fought between the British Empire and the Boer states in South Africa, intensified as British forces laid siege to the Boer capitals of Pretoria and Bloemfontein.
Prince Imperial Louis Napoleon, the only child of Emperor Napoleon III of France, died in battle during the Anglo-Zulu War, bringing an end to the hopes of the Bonaparte dynasty to restore the French Empire.
3/1900
In March 1900, a secret Chinese society known as the Boxers launched a violent uprising against foreigners and Christians in China, leading to a siege of the foreign legations in Beijing.
In March 1900, during the Second Boer War, British forces under Lord Roberts captured the strategic city of Bloemfontein in the Orange Free State, significantly impacting the course of the war.
In March 1900, the Gold Standard Act was signed into law in the United States, officially establishing gold as the only standard for redeeming paper money, a crucial economic policy during the period.
4/1900
The Boxer Rebellion, an anti-imperialist uprising in China, begins with the killing of foreign missionaries and Chinese Christians.
The Second Boer War, fought between the British Empire and the Boers of the South African Republic and the Orange Free State, officially ends with the signing of the Treaty of Vereeniging.
The Exposition Universelle, a world's fair held in Paris to celebrate the achievements of the past century and to accelerate development into the next, opens to the public.
As you can see, ChatGPT is obsessed with the Boxer Rebellion and believes it ended in January, began in February, began again in March, and began again in April.
What is going on? First, the mode I'm using is GPT 3.5, less advanced than GPT-4. Omitting the word "world" in "world history" doesn't stop the bot's obsession with the Boxer rebellion. If I run a similar prompt it claims that the Boxer rebellion began and ended every month in 1900 from January to October!
This is because, as so many companies are finding out the hard way, AI is not really an ideal content store. Let's try this again, but with the latest model that sports all those trillion+ params. Surely it will perform much better!
GPT4 vs GPT3.5 as a "historian"
The first thing I notice is that the prompt isn't specific enough when it comes to how to shape the output. GPT 3.5 returned a "consistent enough almost always" format to code against, but switching the model changes the shape of the JSON, so I need to be more detailed about the output I want. This is easy enough to do by adding a caveat like this to the role:
Use the following example to output the result as valid JSON: { events: [ { title: "title", description: "description" } ]}
Will the model still be obsessed with the Boxer rebellion? Will it still get dates of events "almost right" (often off by exactly a month, strangely)? Let's look at a few entries:
1/1900
In January 1900, the Boxer Rebellion, a violent anti-foreign and anti-Christian uprising, intensified in the Northern provinces of China. This rebellion initiated by the secret society, the Righteous Harmony Society, was against foreign imperialism and Christian missionary activity in China.
Queen Victoria celebrated her 63rd Anniversary as British Sovereign in January 1900, an occasion marked as the Diamond Jubilee. It marked a significant milestone for Queen Victoria as she was the longest-serving monarch in British history until being surpassed by Queen Elizabeth II in 2015.
In January 1900, the Imperial British Navy commissioned the U-1, the first German submarine. This marked a significant development in naval warfare technology.
2/1900
On February 5, 1900, the U.S. and Great Britain signed this treaty. It was an agreement which provided the United States with the exclusive right to create and control a canal across the Central American isthmus to connect the Pacific and Atlantic Oceans.
The British Labour Party was founded on February 27, 1900. Initially a minor party, it grew into one of the most powerful political parties in the United Kingdom, shaping the country's social, political and economic agenda throughout the 20th century.
The Battle of Vaal Krantz took place from February 5 to February 7, 1900, during the Second Boer War. Despite it being a British offensive, the battle ended inconclusively and was another setback for the British in the early stages of the war.
3/1900
In March 1900, Sigmund Freud's 'The Interpretation of Dreams' first edition is published in the United States. It introduces psychoanalysis to the American audience and the world, revolutionizing how we perceive and interpret dreams and the unconscious mind.
In March 1900, a number of German steel companies merge to create the German Steel Trust. This event is significant as it marks the beginning of the industrial consolidation in Europe leading up to World War I.
Representatives of the Trades Union Congress establishes the Labour Representation Committee in March 1900, setting the foundations for what would later be renamed the British Labour Party. This event marks a critical moment in British politics as the party became one of the UK's two main political parties.
4/1900
On April 11th, 1900, the U.S. Navy purchased its very first submarine: the aptly named USS Holland. Named after its designer, John Philip Holland, this breakthrough vessel set the precedent for underwater naval warfare.
The world-renowned Paris Exposition of 1900 opened its doors on April 14th. The event showcased progress and invention, attracting almost 50 million visitors and featuring notable attractions such as the Grande Roue de Paris Ferris wheel and the first talking films.
On April 30, 1900, President William McKinley signed the Hawaii Organic Act, officially making Hawaii a territory of the United States. This set the stage for Hawaii's eventual statehood in 1959.
5/1900
The second Olympics of the modern era began in Paris, France on May 20, 1900. It was held as a part of the 1900 World's Fair.
In China, in May 1900, the violent anti-foreign and anti-colonial Boxer Rebellion intensified. It led to the siege of foreign legations in Beijing that commenced in June.
The Irish Literary Theatre in Dublin, the forerunner of the famed Abbey Theatre, opened on May 10, 1900. It played a key role in the Irish cultural renaissance in the early 20th century.
6/1900
On June 14, 1900, Hawaii officially became a territory of the United States. This development ended Hawaii's independence and marked a significant milestone in the US's imperialistic expansion.
Beginning in the late months of 1899 and escalating in June of 1900, the Boxer Rebellion was an anti-imperialist uprising against foreign influence in areas such as trade, politics, religion and technology in China.
The SS Deutschland, the first ocean liner of the North German Lloyd line to hold the Blue Riband, was launched on June 10, 1900. The SS Deutschland set new standards for passenger travel, eventually contributing to a boom in transatlantic crossings.
At first glance, this looks more reasonable. We still have more than one entry about the Boxer Rebellion, but with more context -- it no longer makes claims about the conflict starting in each month.
However, let's spot-check some entries. For example, the second Olympics of the modern era began on May 14th, not May 20th. It was indeed held as part of the World's Fair (which was at the time a much bigger deal than the Olympics). "Close" isn't good enough when we're dealing with historic facts, is it?
It also claims that Hawaii became a territory in June of 1900, which is a lot less "almost there" (it actually happened in August of 1898). Similarly, Queen Victoria's Diamond Jubilee wasn't in 1900, it was in 1897. Get the idea?
These dates are still wrong, even though I explicitly ask the model to verify that the dates of the event are accurate as part of the step-by-step breakdown. I could take the prompt engineering side of this further, but let's step back for a second and think about this.
With AI, Use Case is Everything
With a basic understanding of LLMs, you can understand why it does poorly with trying to understand exact dates and times that events in history happened. This use case is kind of "asking for trouble" because of how the LLM works and its propensity to hallucinate facts, especially if you're tying to do something specific like this.
This is why you have to ignore the hype offered by people that sell AI products. Yes, Nvidia's CEO thinks coding is dead...but they're trying to make a product, not provide science...which, to be clear, Jensen Huang isn't a computer scientist, he's an electrical engineer. His perspective on coding isn't exactly grounded in experience, but in commerce and marketing.
Yes, AI will be revolutionary (in part because of investor inertia and hype regardless of use case), because the LLM is a "natural language specialist"...but you have to think about what that really means in the context of your use case. In broad terms, technology isn't convergent as some like to claim. There's still such a thing as radios, and vinyl today is more popular than it was twenty years ago. Printed books won't go away despite the abundance and convenience of ebooks. Tech tends to be divergent like this. The idea that AI will become a one-stop place where innovators can input their application idea and get something useful spat back out is...well...unlikely, especially within their 5 year timeframe.
Technological refinement isn't an inevitability. People expected self-driving cars to materialized a lot faster than they have, for instance, but it's entirely plausible that this tech will need another 10 or 20 years to bake before you can simply auto-taxi your way across the nation. Just because LLMs are advancing by leaps and bounds today doesn't mean that the inherent issues with this technology are guaranteed to be solved, never mind within 5 years.
Coding is a good use case for LLMs, though, because coding is ultimately about language. It's a lot more similar to writing a novel than doing math, so it isn't unreasonable to claim that a smart enough LLM can manage this with some level of reliability...but never with complete reliability, because that's the nature of deep learning networks. They are inherently random and no one can truly predict the output.
The best use cases for LLMs will be in applications where subjectivity is expected. For example, asking the AI to summarize text, analyze the sentiment of customer messages, or write a short bit of descriptive text....these are things it can excel at as a natural language specialist! Asking it to find a specific answer? Or treating it like a search engine that can return facts based on the prompts given? That's far less likely to work.
Technology Adoption Has Little to do with its Quality
For all the people that talk about how LLMs are "good, but not completely ready", there's some firm that deploys a chatbot, regardless...then ends up on the hook when that bot conjures a return policy that favors the customer. As they should be, because if you decide to deploy an LLM for customer service, no one is going to be sympathetic when it goes awry.
As of today, there is no way to guarantee that any LLM behaves exactly the way you want. Remember that scientists are still trying to understand why synthetic neural nets even work. This is one tech where we know exactly how something works without knowing why, and that uncertainty is something AI firms want to sweep under the rug, but in a sense, it means no one can ever be entirely confident that a neural net will behave in a deterministic manner...at least out of the box.
When someone claims that an AI is demonstrating emergent properties, you can annoy them by asking them to "prove it", because there's no way they really can. That's not to say it isn't, but with our current level of understanding, we can't do better than "because it seems like it". It might still be impressive and isn't anything to scoff at, but it's almost impossible to prove without science fully understanding why neural nets work.
Regardless, firms are interested in AI because it will save them money...and history has proven (time and again) that adoption is not about objective choices. Ready or not, people will deploy AI, often in use cases where it's a liability for the company.
For example, Kickstarter bought into the Blockchain craze (because of an investor), claiming they would move their infra to Blockchain because "it was the future" yada yada. It never happened because it was a bad idea and not actually useful for hosting a platform. It's no surprise that this announcement sounds like an advert for blockchain...but now we know that the investment they received never required them to actually implement the tech, so they didn't. Because it was a bad idea.
Yet...from the outside, it might make people think "Kickstarter is doing this, we need to follow". This "caravan effect" in corporate culture is very strong, and some believe it's a driving force behind recent tech layoffs even as companies posting amazing profits.
Next Steps for This Project
Now that I've ranted for a while, how do we move forward with this project? We could further refine our prompt to try to force it into something more accurate. For example, one approach is to ask it to summarize a year more broadly and provide each month breakdown in the JSON -- this would help prevent it from repeating events. However, this doesn't stop it from hallucinating. When I ask GPT-4 to simply provide some important events that happened sometime in 1900 (without providing an exact date in the description), it still fails. LLMs suck as a "knowledge engine", so this still isn't a huge surprise...but I did expect a bit better, to be honest.
We could also try different permutations in breaking this prompt down step-by-step in an effort to improve it. Perhaps there's some "magic keywords" that force the model to co-operate, but even if that were true, you'd need to audit every bit of content to actually publish anything, wouldn't you?
We could also investigate prompt chaining -- for example, by asking it to "check its work". Or we could look into new models and new platforms, too! Maybe another LLM is better tuned to this specific use case than GPT-4.
In reality, though, the fact is that LLMs are not good as a means of creating content like this -- a better option would be to obtain a database of events that happened in 1900 and ask the AI to build custom summaries specific to our use case. That's a good use case because it deals mostly in the realm of natural language processing where it can excel.
That's just the start, though, because we can also use embeddings or fine tunings to see if we can empower the model with more context or tune the model to better fit our needs. We will look into that later, because this article is long enough, already. I'm particularly interested in how much context we can provide using custom vector datasets via embeddings.
Overall, the "magic" here isn't so much that the bot returns accurate data (because it doesn't). The "magic" is that it understands what I want just by natural language specification. That is the fantastic promise of LLMs, that they have the ability to understand my intent and act on it. Basically, the ability to "talk" directly to the computer without the lovable middle layer called a programmer.
That said, there's a (sometimes massive) gap between understanding what you want and the capacity to deliver it, and that gap is going to be very important as this technology is deployed. This is especially true for the vast majority of companies that can't afford to train bespoke deep learning networks for their use case and must rely on tuning pretrained models to stay cost-effective.
In case you missed it, you can find the source code for this project here.