Engineering with AI 6: Making a knowledge base

By Eric Koyanagi

Posted on 03/19/24

As we've learned, LLMs work best when "conditioned" to spit out the data we want be feeding it very specific input that tells it what to do. That seems logical enough, right? This article looks into using the (as of this writing still-in-beta) assistants API and augmenting it with "knowledge".

Joy, New Acronyms...What's a RAG?

Just when you thought computer science had hit peak acronym, we've delved into the world of AI, where we can drown in a fresh wave of them. The ChatGPT definition for RAG is:

RAG (Retrieval-Augmented Generation) in AI is a model architecture that combines the strengths of both retrieval-based and generation-based methods for natural language processing tasks.

In other words, making the LLM less stupid, since it is merely a predictive model trying to do the best it can to magic up the answer you want from the input. The NVidia blog post on this topic compares the LLM to a judge and RAG to the court clerk, whose duty is to research specific facts. This way, the judge can not only work off their vast knowledge, they can cite specific cases from authoritative facts. Putting aside the terror of this particular metaphor (no one needs chatGPT to be a judge but it would make an interesting reality show), it's a good way to describe the role of RAG in generative AI.

We already learned that LLMs suck at facts, so we have to prompt them carefully. RAG takes this further by augmenting the LLM with specific, factual knowledge, reducing the possibility that it will "make a wrong guess" (we're all calling it hallucinations, for better or worse).

Being "Lazy" and Using the Platform

We've been here before, and anything that best done yourself can and will become "platformized" at lightning speed! We have a few choices for building knowledge into our LLM without implementing our own RAG, one of which is the assistants API mentioned above (I keep wanting to write "assistance API"...why do those two words have to basically be the same?).

Don't get me wrong, it's worth learning how to build our own RAG (more on this later), but tech isn't new at this point. The business powers-to-be know that they make more money by adapting their product to common use cases...and leveraging LLMs to "use" repositories of existing knowledge is a massively important use case. Not only that, the API has multimodal capabilities, so it can (again in theory) churn out files like spreadsheets or PDFs, the lifeblood of oh-so-many businesses.

Assistants can have knowledge attached to them (for a fee). Note some hard limits: you can have at most 20 files, and those files can be 512 MB each as of today (I mean when I wrote the article...duh), but each file caps out at 2,000,000 tokens...so it's unlikely you can stuff 512 MB worth of dense text with only 2 million tokens! Ideally, we can feed it information and have it spit out a neat little PDF report that has actual facts. Neat...at least in theory.

Another note is that the model automatically chooses to either pass the file into the prompt or perform a vector search for longer documents and you can't directly control that behavior. Also worth noting:

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls

The Assistants API automatically manages the context window such that you never exceed the model's context length

This is one way to say that this can get expensive using larger files...although that isn't exactly clear in their retrieval pricing section. I guess we'll learn that through experimentation!

Creating a ChatGPT Assistant

This is a fairly simple API operation. We'll expand our Laravel-based backend to be able to handle this. As before, we can just create a new strategy to explore this use case and leverage all the same boilerplate and repo.

It's worth implementing this as a separate service -- this is still a beta feature and is likely to change quickly. The limits described above will likely be relaxed even more over time. It's unclear if this will ever become cheaper, though...probably, but LLMs are expensive and resource-intensive. They aren't "efficient" in the scientific understanding of the word.

The first step is to create the assistant, which we can actually do with the UI in the playground. If you're building a static knowledge base and you already have your files ready-to-go, you can WYSIWYG this step!

Otherwise, programatic creation is simple. In our Laravel app, it might look something like this:

$response = Http::withHeaders([
    'Content-Type' => 'application/json',
    'Authorization' => 'Bearer ' . $this->apiKey,
])->post($this->baseUrl . 'assistants', [
    'instructions' => $prompt,
    'model' => $model, 
    'name' => $name,
    'tools' => json_encode($tools)
]);


return $response->json();

For knowledge retrieval, the tools array must include the string retrieval. Another neat tool is the code interpreter. Per their docs: "Code Interpreter allows your Assistant to run code iteratively to solve challenging code and math problems". Giving LLM a python sandbox where it can write and execute code is obviously powerful, especially because LLMs suck so hard at parsing even simple math problems, otherwise. Again, the LLM's strength is a natural language processor and since code is absolutely a form of language, it can do this better and more effectively than trying to "solve" a math problem purely within its neural net, which wasn't designed for that use case at all. This is why rumors about "Q*" are meaningful, even if entirely speculative -- an LLM capable of "reasoning" with math nearly as well as it can with language might be a huge step forward...if that's what it is, which...I guess we'll find out soon enough, or we won't. That's the joy of tech sometimes.

Back to reality, now to attach our knowledge to the model!

Preparing and Attaching Knowledge to an Assistant

Let's make sure we are clear about one thing: RAG is not training. LLMs like OpenAI are pre-trained; there is no way to "train" GPT on your data. The RAG extracts relevant data from your knowledge store based on the prompt (a step called retrieval, simply enough), then augments the prompt with additional information from our documents. This is much closer to prompt engineering than training. By "conditioning" the LLM with this vetted, factual knowledge, we can improve its ability to generate the content we want -- which is the same idea behind prompt engineering. We want to give the model better tools to understand what's factual and important, and based on that context, we can expect the LLM to generate more accurate results.

It's important to understand this because it affects your strategy.

For example, If you have a large quantity of similar documents, you might not need to throw all of them in the knowledge store as you would when actually training a model. You'll need to experiment with this to find the right balance between accurate and cost. It also matters how you format these files, because as of today the limits are fairly strict -- with only 20 files available, you may need to concatenate content into fewer but larger files, so long as they remain under 2 million tokens (and 512 MB).

We are going to explore a very simple use case: let's build a bot that can answer questions about an employee handbook. This way employees can ask questions in simple terms, and we'll see how robust the result is when loading our LLM with this knowledge. All we need to do is upload one PDF and associate it with our assistant.

The code is fairly boring, but it would look something like this at a high level:

$fileResponse = $this->assistantService->uploadFile("/example/file.pdf");
$this->assistantService->create("Test Assistant", "Test Prompt", ["retrieval"], $fileResponse);

See what I mean? It's boring stuff; we just have to upload files, remember the file IDs returned by openAI, and then attach those files when we create the assistant by adding the file_ids parameter to the array above, after 'tools' (or whatever the heck you want to put it).

Doing this programmatically could be really useful if you have a complex use case, but in this example...we can exercise this entirely in their UI if we want.

So...how well does it work?

I quizzed the LLM about a few different topics from dress code to animal policy to time card. I explicitly tell it to respond with "Sorry, please ask your supervisor as I cannot find an answer to this question" when someone asks something outside the scope of the provided document. This seems to work when I quiz it about subjective topics not covered in the handbook like "how do I get a raise?"

Otherwise, it does a great job of sticking to the source documents when quizzed about a few subjects. Retrieval is fairly slow, but we can improve the speed by uploading a plain text file rather than a PDF. This is also more reliable in general because PDFs are a silly, silly document format that should have never existed to begin with...*cough* okay, back on track.

One thing I notice when providing knowledge is that it likes to quote from that knowledge very directly -- spitting back the source file word-for-word sometimes. That's absolutely great for this use case, but it might not be ideal universally.

It does also sometimes summarize and reword things to be more clear.

For example, asked about bereavement leave it quotes from the handbook directly, but adds some context missing in the source text: "This means the actual amount of bereavement leave you’re entitled to might depend on the specific requirements of the state or local laws applicable to your work location."

In this case, the added context helps make things more clear, which is the point of throwing this sort of information into an LLM. This way your employees can ask it plain language questions and get replies back they can hopefully understand, and even follow up with a question to try to get more detail or ask it to rephrase something.

However, it still fails on some things. For example, when I ask it to provide the legal department's contact info, it can't understand that. It does instruct me to "contact legal" if I have a legal issue and provides the file as a reference, but doesn't know how to extract that specific information from one chart in the document.

One simple way to improve this is to have the prompt return a contact number for general questions instead of just telling them "sorry" -- you always want some fallback that will point users to the information they need if the LLM can't understand.

We can also help the LLM along by adding this context into the prompt explicitly, and this is why it's important to iterate and collect feedback. If you realize that the LLM is weak in some area (like understanding contact info), you might be able to aid it by being even more explicit in your prompting. Further, simply converting the doc to text might help as data could be lost to the absurd nature of PDF-land.

Conclusion and Future Steps

Predictions with AI are a dime a dozen, but I think "knowledge" is one of the most critical issues for LLMs to solve. The fact that the LLM is very good at parsing natural language to understand "intent" is revolutionary, but people are realizing how using something like GPT as a knowledge base doesn't really work because it doesn't have a concept of knowledge.

Closing that gap is an urgent goal for all the major players in this space -- it's not enough for the LLM to know "what" you want, it has to be able to actually deliver it, which means it must become better at reasoning and knowledge.

In other words, I think a lot of the steps described here will change radically or perhaps even become obsolete. This is such a critical problem to solve! As I like to say, tech isn't "inevitable" -- for all we know, the problem will never be fully solvable, but there will be rapid iteration on this concept, so it's worth exploring the latest APIs and betas.

Writing our own RAG is interesting and I'd love to delve into this topic later, but these platforms will absolutely adapt and expand to improve how this works and make life easier for developers wanting to tap into the power of LLMs.

Check back in a year and this will be "platformized" even more!

« Back to Article List

This Small Corner