Making a Universal Web Crawler with AI and RAG

By Eric Koyanagi

Posted on 03/26/24

LLMs are great, but there's still a massive amount of misunderstanding about what AIs can actually do. Putting aside the hype, LLMs are natural language specialists -- they have an uncanny ability to understand what we want simply via plain text, something computers have never really been great at. As of now, they aren't really ideal as knowledge engines because their focus is on understanding language, not "knowing things".

Making a Universal Web Crawler

There's a lot of legitimate reasons why you would want to programmatically "scrape" a website...granted, the idea is often used for "evil". Knowing we can augment AIs with specific knowledge (a process called RAG), we can think about an AI powered web crawler that ingests some HTML, tokenizes it, then creates a vector database to load into the conversation. These basic steps will be the same no matter what your content source is.

To make an AI powered crawler, we'll first do a web request to obtain HTML content:

url = "https://en.wikipedia.org/wiki/Felix_of_Burgundy"
response = requests.get(url)

Wikipedia is a good place to start (sorry Wikipedia) because it has text-heavy webpages that are ideal for this use case.

Splitting the HTML

We can use LangChain's Code text splitter to parse the HTML document returned. As a basic example:

html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([response.text])

That's almost directly from their example, only we supply the response HTML as the document. In case you missed it, the Language package can be found in the text_splitter module (it's stated near the top of their docs, but easy to miss if you're in a hurry):

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

Now we create the embeddings (vector database), again tapping into LangChain's convenient modules:

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(html_docs, embedding=embeddings)

All we have to do is pass our html_docs to the FAISS (facebook AI similarity search, oh how I love branded acronyms!).

Make special note of the chunk_size and chunk_overlap params because we will need to come back to this and tune them compared to the example.

Before looking at the next step, it's useful to understand a few things about vectors. If you don't remember math class, that's fine (I didn't either) -- if you really want a hands-on understanding of vectors, boot up a game engine like Unity and play around. It'll be more fun and you'll have no choice but to understand vectors inside and out.

Vectors are points in space, but also directions. If I have two vectors, I can create a "pointer" from one to the other simply with subtraction. This is the basis for how the LLM understands the relatedness of words, but it can work in more than one way.

Maybe you've seen the phrase "cosine similarity" -- that just means that a vector "points" in the same direction as another word. In this similarity metric, it doesn't even matter how far one word might be from another...so long as it points in in that direction, it's considered related.

Some don't think this is the best metric, because it uses normalized vectors that don't care about magnitude (it only cares about direction). This is why understanding the underlying concept of a vector will be important even if you're "merely" working with a custom RAG -- you might want to experiment with different metrics and see what works best for your use case. At this point, there's still a lot of trial and error and subjectivity involved in this, and that might always be the case. Still, understanding the underlying math can help you make better choices.

Loading the RAG, Asking Questions

Now that we have our handy vectors, we can actually use it as an RAG and start asking questions.

llm = ChatOpenAI(temperature=0.7, model_name="gpt-4")
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
)

Obviously we can adjust the temperature or use a different model (and with the way LangChain is implemented, it doesn't have to be OpenAI), but the key is in how we pass the vectorstore as a retriever. Once we go to ask the LLM a question, we quickly see a problem:

query = "Summarize everything you know about Felix with at least 1 paragraph"
result = conversation_chain({"question": query})
answer = result["answer"]
answer

This (at least for me) told me that there wasn't enough information to create a summary! But...why? Surely it knows exactly who we are talking about (even without a surname) and has enough to draw from, given that we've loaded the entire wiki article via RAG. Welcome to the world of tuning.

To fix this, we need to move away from the default of 60 as a chunk size. For our use case, 600 is better (but maybe still not enough). We also set 120 as the chunk_overlap. There is very much a push and pull when it comes to these values -- larger sizes means it might do better in broad context (like asking for a summary, it can't do that with only 60 tokens per chunk), but also means it might miss some key details. Similarly, websites with dense content might do better with larger context windows than others.

Conclusion

With tools like LangChain, it isn't that hard to load custom data into OpenAI's pre-trained models, even if it happens to be HTML yanked off some website thirty seconds ago!

Although it would need a lot more tuning and iteration, a tool like this could even be useful in real life. Consider the rather dismal state of accessibility on many websites. Having a tool that can parse the page, generate a reasonable text summary, then read that summary aloud could be very helpful.

I'm sure that people are already using pipelines like this for less ethical purposes, like vacuuming up some content, using AI to "rephrase" it, then posting it as original. This is a different technique to just asking AI to conjure up articles because it "grounds" the predictive engine in something (ostensibly) real, improving the odd that it won't hallucinate embarrassing facts.

References

https://scalexi.medium.com/implementing-a-retrieval-augmented-generation-rag-system-with-openais-api-using-langchain-ab39b60b4d9f

« Back to Article List