Engineering with AI 4: Generative Content

By Eric Koyanagi

Posted on 03/14/24

We learned an obvious lesson with our previous article -- LLMs are very bad as authoritative sources of structured data, especially precise data like dates and times. Based on how LLMs work, this isn't a huge shock, but it seems like a stark contrast to the wild promises erupting across popular media. This is worth revisiting with fine tunings where we can enrich our model with labeled data (e.g. a database of historic events), but let's set that aside for now and explore a use case that better fits the nature of the tech.

Some Cleanup

We can use our existing project to create content in much the same way as before. The core idea is the same, we want to query ChatGPT, save the results, and use that to bake a static page we can upload anywhere.

I've done some minor refactors to fix some of the low-level looping logic in our GetHistory command. This delegates more responsibility to the strategy as it should. This way, we can explore an entirely new use case while still keeping the core backend. With this in place (see the repo here), we can make a new strategy and explore a more reasonable use case.

The Goal

The concept is similar: we want to use chatGPT to create a static page that has instructive content about history. This time, however, we'll prompt it more carefully. It's also time to explore the world of multimodal AI. For now, we're going to aim for the following:

We feed the model a short description of a historical event we hand-pick and seed into the DB. This eliminates any issues with date hallucinations which are almost guaranteed otherwise (seriously, GPT4 is very happy to make up dates).
We ask the model to provide a longer description of the event. We ask the model to write a short poem about the event, too...that has no chance of going wrong, right?
Finally, we'll use the text-to-speech abilities to have the AI perform the poem, too.

If all goes well, we can iterate on the prompt (using the backend structure we already have) and eventually construct a static page using the content created. A bonus step would be to also create images we can use as backgrounds, e.g. by asking for it to "paint" a scene depicting the described event.

This is more likely to be successful because we are playing to the LLM's strengths as a natural language processor rather than trying to treat it as a knowledge store. Hallucinations are still probable, I would guess...especially if we try to feed it very high-level prompts like "summarize the battle of Gettysburg."

Tuning the Prompt

The role will look something similar to this:

You are a historian. You will be given some detail about a historic event.

Step 1: Create a detailed description of this event in history. This description should be at least three paragraphs long.

Step 2: Create a poem that describes this event in a consistent style. The poem should be at least three sentences long.

Return the results as valid JSON following this example:

{ name: "", description: "", poem: "" }

When fed a short description of the "Battle of Gettysburg", it seems to do a good job providing a more detailed summary and an artistic poem. Here's a snippet of the "creative work" it authored:

Gettysburg echoes, its story still unfurls,

Of a time when conflict swept the world.

It stands as a testament, grim and grand,

Of deep wounds carved into a nation's land.

The tone does fit the event, at least. Now we're in the realm of the LLM's strengths, and it's absolutely valid to both fear and loathe this idea while simultaneously being enthralled by the potential. I think this duality is a healthy thing, because we should remain skeptical about the bold claims from AI marketers (and how this tech will impact society) while still being curious about the potential. Now let's do this with more scale, creating content for a series of historical events and making the model "perform" the poem. You can also maybe see the flaw, because I expect if we prompted it with a lie about the dates Gettysburg took place, it'd happily repeat that lie, being largely ignorant about details like "exactly when something happened".

With how LLMs work, there's a good reason prompting it with some background facts about the event will work, as it "weights" the model and forces it to pull more relevant words from the multi-dimensional vector database. This is basically how the New York Times "forced" GPT to return word-for-word summaries of their copyrighted articles -- they prompted it with a few sentences from their articles, and the model took over. So while this might force the model to return better data....the implications for copyright are not entirely clear. The more specific you prompt the better...but does that then expose you to copyright issues? Who knows! This is the bleeding edge of copyright law right now...and history has shown that the bleeding-edge tech and copyright laws have a, uh, tense relationship.

Seeding the Data

We're going to try this again, but we aren't going to give the AI easy events like Gettysburg where there's so much written about the topic. Again, we're going to focus on some events that happened in the year 1900. With the backend already built, we can create a new model for this specific use case, relate it to the existing "DataRun", and basically be done with most of the work. All we really need is to write a new strategy, at least for the initial data seed.

This time, we also need to have another layer of integration where we use the results of our chat completion to create a spoken poem...which, why am I subjecting the world to more spoken word poetry? I don't know, evil works in mysterious ways.

This is one simple example of prompt chaining. With prompt chaining, we feed the results of one API call into another. This can be useful to validate answers or process a response further...or in our case, we can use it to drive multimodal output. The backend structure could be complex with more scaled-out applications like chatbots, and as mentioned before it's only a matter of time before WYSIWYG "behavior tree" designers roll out.

But how do we actually seed this data? Well, we have to actually go research events in the 1900s, learn about them, and create short text descriptions that nudge the AI into crafting something accurate. We are careful to include the date of the event in this prompt, since we know it sucks at knowing when a certain event happened if you don't help it along.

Working with the Data

Most of logic can be summarized with this snippet:

$data = $this->getData($library->event);
list($name, $description, $poem) = explode('###', $data["choices"][0]["message"]["content"]);

$poemRecord = HistoryPoem::create([
    'name' => $name,
    'description' => $description,
    'poem' => $poem,
    'run_id' => $run->id,
]);

$this->saveAudioData($poem, $poemRecord->id);
$this->saveImageData($library->image_prompt, $poemRecord->id);

Gross, using the "list" function and exploding a string...? Well, GPT4 has trouble consistently returning valid JSON no matter how you phrase it. Sometimes it works, usually it did not. That's likely because I ask for a long block of text (three paragraphs)...good 'ole delimiters are more reliable. Instead of asking it to produce valid JSON with specific key, I tell it:

Return the name, description, and poem, delimited by "###". For example: "name###description###poem"

This code is not too complex. First, we use getData, which simply calls the chat completion API. It feeds in the role we define in our schema along with all the tuning params like max_tokens and frequency penalties. The prompt is manually pre-written and loaded from the database ($library->event). In other words, we "get the AI started" by feeding it a short, accurate summary of some historic event -- which (ideally) tunes it to return useful, factual information. We'll see about that, later.

After separating the fields, we persist the data to the database and create an MP3 based on the "poem" created in the first step. This API is almost identically shaped to the completions API, so it makes for an easy addition to our service layer.

Is it right...? Can I finally build a page...?

Oh, boy, now we get to grade the robot! Just to re-iterate how miserably bad ChatGPT is at understanding dates, I asked GPT (3.5) to check the description for inaccuracies. Some of the gems it came back with include:

King Umberto I was actually assassinated on July 29, 1900, not July 29, 1900
However, King Umberto I was assassinated in Monza, not Monza.
However, King Umberto I began his reign in 1878, not 1878.

I really hope those stories about teachers using GPT to grade homework don't include history teachers! Understanding a seemingly simple concept like equality isn't it's main strength.

My example event was the assassination of King Umberto I of Italy in summer of 1900. Fed in the exact date of the event by our prompt, it did better with chronology and correctly stated the king's year of birth and the year his reign started.

GPT-4's description of Umberto's assassination is now correct (a big improvement compared to our last article), but problematic. Here's one snippet:

His reign was marked by a series of radical reforms, some of which were deeply controversial among various factions of the Italian society. His approach to governance, though impactful, had earned him a considerable number of detractors - one of whom was Gaetano Bresci.

You might get the impression that Bresci was a radical anarchist that murdered the king because they were a political extremist. It would be worth mentioning that this was the third assassination attempt on the King.

Also, the text mentions "controversial" reforms...but that's being very vague, which ends up being too kind on Umberto when trying to understand the context of the assassination, not that we're trying to justify murder or anything. Italy was under martial law in 1898 when rising bread prices led to insurrection in Milan. General Fiorenzo Bava Beccaris restored order, harshly, slaying anywhere from 80 to 400 civilians in an event appropriately called the Bava Beccaris massacre.

Umberto I celebrated Beccaris for his actions, decorating him for merit and making him a senator...he then appointed another military general as prime minster, Luigi Pelloux. In 1899, Pelloux suspended parliament and decided it was better to pass laws unilaterally by royal decree. While this was overturned in 1900 (with Umberto then pledging to reverse course), it's hard to argue this wasn't an attempt to form a dictatorship. Umberto was a man so singularly focused on his military education and so disinterested in intellectual pursuits, he found writing to be "too mentally taxing" and dictated everything. He very much admired the germanic military tradition and as a part of the Triple Alliance had even been advised to ditch parliament and form a dictatorship.

I'm not a historian, but all this context is surely useful when understanding why Bresci assassinated the man...and even though it's never good to "celebrate" a death, having an overly mournful, tragic tone that vilifies Bresci and celebrates "the good king" without more context creates the wrong impression about this slice of history.

This is likely part of how LLMs work, too. When you're dealing with a murder, the "vector word cloud" probably wants to pull words like "tragic" because the bulk of human writing tends to describe murders like that. When dealing with the death of a monarch, it wants to talk about subsequent "turmoil", because that's typically how these things go. It's easy to see how this can erase nuance and context, though.

We can tune this behavior by reducing the temperature or encouraging it to be more specific. We can also try multiple attempts with the same prompt -- the deep nets are inherently random, so some "passes" might be better than others.

All this being said, nothing GPT said was "wrong" as far as I can tell. That's an improvement compared to it's obsessions about the Boxer Rebellion. By being more meticulous with our prompt, we're constraining its creativity and hopefully reducing the likelihood of hallucinations because we're augmenting the probability model. It's great that GPT guides mention that "being more specific is better", but remembering why that's the case can help us when we tune prompts.

As for the poems, they have the same issue as the description. It mentions "discontent" and "reforms" very vaguely, then talks about how the "nation gasped in horror" and how it "ended one era and birthing one of unrest". Eh, come on, that's not really fair. It's not like Umberto's death caused WWI.

To be realistic, every poem it churns out is guaranteed to be horribly cringe-worthy and hilarious. I intend to ask it to make a poem about the Gold Standard Act of 1900 and can't wait to see the overly-serious result. It's too bad I'm committed to subjecting the world to this.

Adjusting Prompt & Temperature, Adding Dall-E-3

I tried this again, this time specifically mentioning some of the context I've summarized above. The AI is still a bit too friendly, though:

The severity of the situation in Italy compelled King Umberto to suspend parliament in 1899, a move that made him both a defender of monarchy and a target of revolution.

Yeah, let's...not describe it as he was "compelled by unrest to suspend parliament". It's very difficult to explain this nuance to the LLM, that it must explain the historical circumstances of a murder while having a more neutral tone toward the victim.

By using a cooler temperature with the summary, it might avoid some of this colorful language that isn't especially helpful for this. I'll also explicitly ask it to adopt a neutral, factual style.

This has moderately better success (with a few iterations). This time, it mentions the massacre, prior assassination attempts, and the presents the social unrest in a more neutral way instead of seeming to make excuses for what some would insist was a coup. I could actually accept its description for this! Is it great? Not entirely, but hey, we expected it to be a C-level student and it finally delivered.

Of course, LLMs are inherently stochastic so if we try the same prompt again, it might be more (or less) correct, complete, or fair.

Since I'm having fun, why not throw image generation into this, too. We'll call Dall-E-3 and pass it a specific image prompt we author -- we could try to feed it the same description of the event we use to drive text generation if we want, but I expect we'll need more granularity with images. Image generation is tricky, especially (as Google learned) when dealing with history...so we might opt to abandon this in the final product.

The Final Page: Exploring the Year 1900

First, we need to build out more events, which means becoming a mini-expert on a few events in the 1900s...because the AI can't get there on its own, as we've learned. That means hand-picking content and engineering each summary to guide the AI toward the right tone and content. We're asking for three paragraphs of detail, but providing no more than two sentences to help it (along with our role prompt, of course).

This is why generative AI is most effective in the hands of people that are experts in the content you're trying to create. The idea isn't that it's a knowledge store where it any 'ole person can instantly snap together content of any speciality just by asking for it...it isn't there, yet. You need to help it along, seeding the predictive vector database and potentially iterating multiple times to get something workable. It isn't entirely autopilot yet, even with GPT-4 and its possibly human-scale neural net.

It won't ever be a "science", either, because it involves two skillsets that are (unfortunately) viewed to be somewhat divergent in many circles: language and computer science. With prompt engineering, the words you pick matter. It helps to understand why they matter, too.

Here is the final page that showcases everything we've done to get to this point. It includes a text description of 5 events that happened through the year 1900, with image backgrounds created by AI and shitty poems created and read aloud by AI. You can see the source code for the project in general here.

Conclusion

Even for a simple page, we had to do work to get here. Yes, we could have built a more simple backend and handled a lot of this static content manually, but even then it takes a bit of work to tune the AI. Generative content is very powerful, but it works best in the hands of people that understand that content, which is (hopefully) obvious.

It might seem like a lot of work for the end product, but it wouldn't be that hard to add more events or make pages for multiple years now that everything is in place. We could produce text, audio, and image content at a much faster clip than any one human could ever manage, even with needing to do some research, ourself. If we had a database of valid data already, we could scale this massively.

LLMs are swimming in hype right now, but like any tool, they have strengths and weaknesses. There is a feeling of magic as the power of NLP allows the machine to "understand" what you want, but you still need to do your part to help it conjure the right result.

While we see how quickly this tech evolves each day, we also don't really know if the limits it has today are solvable or not. Tech isn't "inevitable" no matter what Nvidia's CEO says -- maybe in five years generative AI really will be good enough to replace us humble software engineers...or maybe it'll take ten times that long, or actually prove impossible.

The future is unwritten...unless you ask chatGPT. It'll happily make things up.

« Back to Article List

This Small Corner