Creating an LLM Chatbot that runs on the Raspberry Pi 4
The Raspberry Pi is a humble little computer famed for its low cost and (with the release of the Pi 5) surprising amount of power. Thanks to novel quantization techniques, it's possible to run an open source LLM with even 8GB of RAM. Will this be perfect? Probably not, but it'll be better than we expect for such a humble device.
How to Get Started: Connecting to the Pi
Once you've booted your delicious new Pi 5 up, you need to do a few things to work with the device.
- Grab https://www.realvnc.com/en/connect/download/viewer/ if you want to be able to remote connect to your PI's GUI OS from your dev machine
- Connect your Pi to a keyboard/monitor. Go to Preferences -> Raspberry Pi Configuration from the top-left berry-oriented menu, then click "interfaces".
- Enable SSH and VNC. If you're one of those purist that only will access your device via SSH, you don't need VNC.
- Use an Ethernet cable to connect your Pi directly to your dev machine.
- Use "ssh USER@raspberrypi.local", supplying the user name you picked when you set up the machine to connect via SSH
- Want to remote into the desktop GUI? Use VNC. You can use it without logging in. You can use "raspberrypi.local" as the server address. Or the IP if you want.
You don't really need to do any of this, but I prefer doing it this way so that I can quickly iterate on my dev machine if needed.
Oh, and if you do want a hint in terms of what PI starter kit to buy....I'd suggest getting one with an actual fan. We'll be pushing the PI to its limits and beyond with this! A passive cooling system alone might not cut it. I don't really know, that's a hardware issue and therefore not my problem! 😅
The Goal: A LLM Chatbot Running Locally
- Capture audio from a microphone
- Turn that audio speech into text
- That text is fed directly into an LLM such as Mistral
- The response from the locally running LLM is converted to speech and played as audio
There's a lot of steps involves in creating a locally-running "chat-to-text-to-chat" bot, and there's really no way the Pi will be able to handle it, alone. We'll see that very quickly as we start toying with the idea of running an LLM on something as small as a Raspberry Pi. For now, this goal is just a general roadmap to what we're trying to do.
Running an LLM on a Raspberry Pi
I will start with the middle step: using some text prompt, query a open-source LLM running locally on the Pi. Why start here? I have a feeling there won't be room for much more with this running, and this is really the heart of the concept. Even if I can't make a speech interface, it would still be useful.
Ollama is a great tool for running an LLM locally, and it's great for the lazy and inept like me. We can get it running with two freakin' commands on the CLI:
curl -fsSL https://ollama.com/install.sh | sh ollama run mistral
This is helpful, because we can change the model easily.
And boy, we'll need to figure this out if we want something so ambitious running on something so humble.
With Mistral running, we can easily ask it it a question. I asked it to tell me about the 20th century, and it happily complied, sending the PI's CPU usage to 100% as it went to work. Even if only consuming half the machine's 8GB of memory, we expected about as much. It took several minutes to churn through the response, but it was a thorough, 10-point reply with a short introduction and summary.
Success...depending on your perspective.
Consider the success, here. We've loaded a $80 machine with a 7-billion parameter LLM capable of answering questions about a massive range of topics. It doesn't need a trillion gigabytes of space, either, the ~4GB easily fitting into the micro SD card slotting into the tiny thing that's running this request without even using a GPU.
I next asked it about disguising cake as real objects because an inane, unnamed show on that topic was playing in the background. It happily answered. This time, the answer was constrained to just 4 sentences, still tapping the CPU entirely, chewing up about half the RAM, but providing valid if generic answers. I'd love to follow up with more questions, but I genuinely worry about melting my humble pi.
First, a little bit about quantization, that magic sauce that allows the LLM to run on such a tiny machine at all. The basic idea is that it's a compression technique that maps "high precision" values to lower ones. A very basic understanding of how artificial neurons work introduces us to the concepts of weights, arbitrary and random values that control how signals flow from one neuron to the next. Quantization basically coerces those numbers from 32-bit floating point numbers into something like an 8-bit integer (or even a 4-bit integer). As you can imagine, this reduces the precision off the model, but considering that 1 billion parameters even as a 16-bit float would consume 2GB, we obviously need it to shove a 7billion parameter model onto 8GB of RAM.
This is truly some magical stuff, though!
I wouldn't imagine a micro-brain capable of answering a broad range of questions to fit into a mere ~4GB of RAM, comfortably contained within a device smaller than a piece of toast.
When people talk about the magic of LLMs, things like this are just as much a part of the conversation as enterprise-grade tools like OpenAI. Google should be terrified. A future of the search engine could very well be bespoke, offline-capable LLMs free of ads, capable of directly parsing a user's query thanks to the magic of NLP.
How to Switch Models with Ollama
We have many options for improving our performance. First and most obvious, we can use a more aggressive quantization model. Second (and in addition), we can give up trying to use a 7-billion parameter model and use something smaller. This takes trial and error -- do you use a beefier model with more aggressive quantization, or a lower-param model? As with many things generative AI, you'll need to try and iterate.
Another option is to give up on having the LLM run directly on the Pi and to throw it onto my gaming PC with its fancy Nvidia GPU and ample RAM. We can still fulfill the idea of having an offline LLM....requests within my own network count as offline, yeah? Perhaps we would have more success if the Pi's resources were dedicated only to three things: converting speech to text, querying the LLM via a POST request to a server hosting the LLM running on my network, then converting text back to speech.
But first, we need to understand how to actually experiment with this. We can browse a list of available models here: https://ollama.com/library.
When you click on a model, select the "latest" dropdown to show the variants available. Some models offer 7B, 13B, or 70B versions...and we obviously can't handle 70 billion parameters. This also shows the many different quantization options we can employ, and that's what we're interested in here.
Let's try using Llama2's 7b model, but doing an aggressive 2-bit quantization.
ollama run llama2:7b-text-q2_K
Again, this pushes our tiny li' machine to its limits in terms of CPU. It doesn't have a GPU, so it's hard to be shocked, here. More than that, the answer was a lot worse. We also shouldn't be shocked with this; we're doing a lot of rounding with this level of quantization.
Once again, I asked the LLM to tell me a little about 20th century history. This time, it replied "nobody will be offended or anything like that, i was just curious". Okay? Then it proceeded to explain how "most of the 20th Century has been an absolute shambles". It then summarized each decade of the 20th century in exactly the same way with all the energy of a dour emo teenager that definitely didn't study history but sort of does get the gist:
The 1960s were pretty terrible too; mostly due to the fact that people decided it would be funny to kill millions of people because they thought that'd be good entertainment for them - and then we went nuclear on each other for some unexplained reason which no-one quite understands yet but seems likely to cause a lot more pain in the long run.
How drab...and while snarky, I get the feeling that this particular student didn't read the textbook. In other words, quantization has its limits. Round too much, and you end up with nonsense.
Let's try loading a specific "mini" model with 3B params and 4-bit quantization.
ollama run orca-mini
Still, we'll gladly eat all of the Pi's resources...but here we see a good middle-ground.
It isn't nearly as unhinged as the heavily-rounded two-bit Llama 2 (7B). It gives us a short paragraph summarizing the 20th century with nothing particularly bad to pluck out and mock. Further, it does it fairly fast -- still slow by any real standards, but fast for what it's doing. There's many other small models we can experiment with.
Have we solved our problem? Sort of. I can say that I have an LLM running on the PI that gives reasonably good answers and works "reasonably" fast, but we're going to have to take things further in the next article if we want a true chatbot. Also, I can't be confident that the Pi won't melt if the bot were to run for any length of time. It still taps the CPU, a lot.
Conclusion
There's not much practical reason to throw an LLM onto a Raspberry Pi -- it's got no GPU, so it'll never be optimized for this task. Still, I was shocked to find that yes, it can do it and it can do it reasonable well considering what it's actually doing.
If we take a step back, let's at least be amazed that a ~4GB Mistral install can answer a huge range of questions (and can do it very well) and actually work even on a <$100 machine. We will continue working on this in a future article.
Yeah, yeah...we could query an API on the cloud and make this work quickly with the snap of our fingers, but that sort of misses the point in how powerful and agile these LLMs have already become. If it can run on a Raspberry Pi, it can run just about anywhere, especially in time. How will this shape a business like Google that very much depends on people turning to the Internet instead of querying a microbrain built into their phone?
That said, running a query on Google is orders of magnitude more efficient than what I've "accomplished" with a brutish LLM. Asking even the "lean" 3B quantized model a simple question (like "What's the capital of Columia?") requires the entirety of the machine's CPU....for a while. Fetching that data from a structured source will perhaps always be "more efficient" in terms of energy. In a world there that absolutely does matter, there's a reason why people are critical of AI's role in increased energy usage. Playing with these tiny models shows just how resource-hungry they are.
Whether you think AI is a threat or a wonder, realize that it isn't just about making LLMs bigger with massively more parameters and neurons, but also making them smaller, more efficient, and more portable. A bespoke model trained just to handle narrow customer service questions might be far, far superior to a 100 billion supermodel that wants to be everything -- especially if it can run on a phone-sized box strapped onto a drive through or kiosk. Like it or not, that future may not be so far away...and it might look a lot closer to this than a behemoth cloud-based model, especially as enterprises continue to question if the cost and promise of the cloud are actually living up to expectations and open source models continue prove themselves against the closed-source heavyweights.