Blazing Fast Local Chatbots with Llama.cpp and Gradio 🦙⚡️

Published: 2024-06-25 · Updated: 2024-06-25

In this post, we’ll run a state of the art LLM on your laptop and create a webpage you can use to interact with it. All in about 5 minutes. Seriously!

We’ll be using Llama.cpp’s python bindings to run the LLM on our machine and Gradio to build the webpage.

If you are not familar with Llama.cpp, it is an open-source C++ library that enables efficient inference of large language models (particularly the LLaMA family) on consumer hardware through aggressive quantization, optimized tensor operations, and careful memory management.

Gradio is an open source python library for building performant AI web applications.

Enough talk, let’s code!

Installation

Both Gradio and Llama.cpp’s python bindings can be installed with pip.

Installation Instructions ▶

This will install LLama.cpp with only CPU support. If you have a GPU with CUDA 12.1 installed, you can enable GPU support by replacing the --extra-index-url with https://abetlen.github.io/llama-cpp-python/whl/cu121. There are many supported backends for Llama.cpp, so I recommend you consult the github page to find the best version for your machine.

Downloading The Model

Llama.cpp runs models stored in a special file type called GGUF binary format. You can find LLMs that are stored in GGUF files from the huggingface hub but for this demo we’ll be using the QWEN-2 0.5B instruct model since it is relatively small but performant.

You can download the gguf model with the huggingface cli (it comes installed with Gradio)

Model Download ▶

In this command, we’re specifically downloading the qwen2-0_5b-instruct-q5_k_m.gguf file but you’ll notice that there are many files available in the model repository. Each one corresponds to a different quantization strategy used to compress the original model into the GGUF format. You can consult this table for an explanation of the different formats.

Creating the Chatbot Web App

We’re nearly done. Let’s create an app.py file and import gradio as well as the Llama class from llama.cpp:

Imports ▶

Now we need to implement our gradio prediction function. At a high level, it will take the chat history from the web application, convert it into a format that Llama.cpp expects and then pass that to the chat_completion API. We’ll then yield each new token to stream the response back to the web app.

I’m skipping the low level details here, I recommend you read llama.cpp’s documentation as well as Gradio’s chatbot guide.

Gradio UI ▶

Conclusion

Believe it or not, that’s it!

You can play with our chatbot below and see all the source code (as well as the demo) in this Hugging Face space.