The plan
So far OpenAI seems to be in a great position, having had an incredible response to their ChatGPT release. Google now has Bard, Meta has had their Llama models leaked, open sources models are released by orgs such as Stabilty.AI and Open-Assistant. Everyone and their grandma is trying to conquer the world with their supermassive token predictors. Why not you too? Don’t you want to be the proud owner of your own ChatDIY and impress people at parties? Say no more, follow the instructions below.
In the next sections I’ll show you how to:
- Wrap an open model into a service and expose it through an API,
- And finally add it to some chat interface
Picking up your model
There is no shortage of new LLM being released these days. Those aspiring “foundation models” are supposed to be hard and costly to train, and yet we see new ones coming up every couple of days. The AI arms race is on, and the future will be either exciting or dystopian.
Billions of parameters is the new normal and even boring these days. Let’s remain reasonable and pick a small LLM for the sake of the experiment. Having experimented a lot with Stability.AI’s Stable Diffusion, I was quite curious to play with their own LLM, StableLM. And I was pleased to see that the great team behind Open-Assistant had released a fine-tuned version of StableLM.
7 billion shall be enough, right? Let’s see and start “small”.
Exposing your model with SimpleAI
Prerequisites
Nothing really surprising there, but you’ll need:
- A relatively recent and powerful GPU
- Python 3.9 or later
I provide a Docker image later, which is an easy way to make sure it is reproducible, so I’d recommend installing it if not done yet. Otherwise it is probably a good thing to use a venv (see here) as a good practice, if you don’t want to use containers.
I might be taking too close to my heart the last “S” in the KISS principle and there are probably smarter ways to do it, but I’m simply adding the massive model to the image, so you might want to increase the maximum container size in your Docker configuration (default is 20GB):
| |
We will use several Python packages, including the latest version of SimpleAI:
| |
Creating a model service
So how can we use these Open-Assistant models? I’ve actually found it a bit tricky to find precise informations about it. This is a super interesting initiative, involving a lot of talented people, they’ve put an astonishing amount of work into training and releasing such models, yet you don’t have a clear path to use them. But it’s improving every day!
If you refer to their fine-tuned Pythya 12B model page, you’ll get some instructions for prompting:
Two special tokens are used to mark the beginning of user and assistant turns:
<|prompter|>and<|assistant|>. Each turn ends with a<|endoftext|>token.
And even some example prompt:
| |
It’s interesting to note that the roles here slightly differ from OpenAI’s API:
systemisn’t an option and shall becomeassistant(?)assistantremains the sameuseris here called a “prompter”
Great, now we understand better what we want to achieve. Let’s start from there and use SimpleAI to expose a gRPC-based service (for a quick introduction about this project I’ve built, you can refer to my previous post).
First we need to “translate” from SimpleAI / OpenAI roles to Open-Assistant ones in model.py:
| |
We also have to format the received messages from a list of dict in the shape of {"role": "given role", "content": "message content"} to a input prompt (a str). To do so we will write our custom function:
| |
And we’re done! Ok no I’m joking here, but trust me, we’ve almost done the most complicated part here. The rest is sort of boilerplate code where we:
- Define an
OpenAssistantModelclass - Implement the
streammethod for chat stream responses (stream=truein OpenAI’s API) - Optional: implement a
chatmethod for non-stream responses (stream=falsein OpenAI’s API)
We don’t really need the chat method as we will later only rely on streaming responses, but it’s a good exercise to highlight the difference between the two.
So let’s get back to our model.py. We need to import first a few thing on top:
| |
Then let’s put somewhere the id of the model we’re using:
| |
Then come our previously defined utilities preprocess_roles and format_chat_log, and we can create our class:
| |
We now have to implement our methods:
- For
chatyou have to:- Format your chatlog,
- Tokenize the inputs,
- Pass this prompt to the model to generate a prediction
- Decode the output and do a bit of post processing
- Return the result as a list of messages
[{"role": role, "content": output}]
Which is what is done below:
| |
For stream, we follow roughly the same process, the only difference is that we will yield chunks as they come, leveraging TextIteratorStreamer and threading library:
| |
Now we simply have to start a gRPC server using this model. This can be done in another file server.py quite easily thanks to SimpleAI tools:
| |
And you should be able to just start your server now:
| |
You can find the full example here, which also includes the aforementioned Dockerfile, and a bit of refactor to download model and tokenizer beforehand with get_models.py (you surely don’t want to download dozens of GB each time you start your container).
Adding your model to your SimpleAI server
You now need to declare your model in a model.toml file to make it available in your SimpleAI server:
| |
Now if you run simple_ai serve and go to /docs, you should be able to see your model and even try it out. Congratulations!
Adding a chat interface to your model
If everything went correctly, you should now be able to query your model, either using the Swagger UI, curl command or a client such as OpenAI client. But no one (besides maybe geeks like us) really wants to interact with an AI using curl commands, right? A key component of the viral success of ChatGPT is the super nice and easy to use interface. After all, fine-tuning GPT3 through RLHF is mostly about giving you access to the same capabilities than GPT3, but in a more human-friendly manner.
UI matters. And the only one we have right now sucks. Let’s fix that!
Picking a UI
We’ll use there Chatbot UI, as:
- It looks clean and nice (and even includes a dark theme)
- Is under active development and is popular
- Includes a docker image with clear instructions to use custom environment variables as we’ll want to override some (👋
OPENAI_API_HOST) - Uses a MIT license
It genuinely looked good enough to not need to reinvent the wheel. Pick your battles, implementing a chat interface isn’t rocket science if you know a few things about front end development, but you don’t have to at this stage.
Issues
Ok so the way it’s advertised, we should be able to simply run:
| |
And it would work, right? Well, no. If only things were that easy with code. A few issues we have to fix here:
- So far the UI uses an
enumfor the models, which means we cannot pass an arbitrary model name:
| |
- The default value for
OPENAI_API_HOSTis set tohttps://api.openai.com, meanwhile OpenAI client useshttps://api.openai.com/v1. It’s a tiny difference but that/v1will make or break the API calls, and to be consistent with the official OpenAI client, SimpleAI is also not including/v1in the default endpoints.
Making it work
- To address the
enumthing, simply rename the model inmodels.tomlto one of the allowed ones:
| |
Note: pay attention to the " surrounding the model name, as the . in it would create issues otherwise.
- To add that infamous
/v1prefix, you have several options, including using anginxreverse proxy. A very simple workaround is to not usesimple_ai servebut a custom scriptsai_server.pyand use:
| |
Then simply run python3 sai_server.py instead. Give it a try with a curl command:
| |
Now, if you’re running again the previous command:
| |
And go to http://localhost:3000/en, you should be able to start a conversation with your new AI best friend:

Just don’t fall in love with it.
Going further
All this is a cool little project to tinker around with, but you will have a hard time to compete with Big Tech with this. The model we pick is not really able to compete with ChatGPT, generations are pretty slow, but at least you might have gained some understanding of how it works behind the scenes.
You can now try to build upon this and perhaps:
- Make the whole thing more streamlined by adding a
kubernetesordocker-composelayer, - Build your own UI, or fork the suggested one, to support custom model names,
- Add a second model in parallel to be able to compare the two in an interactive way,
- Add some caching or authentication mechanism
Perhaps you’ll wait for true contenders with more capabilities than the one we’ve used there. Or perhaps a 7B or a 12B parameters model is already enough for your needs. I hope you’ve enjoyed the journey, let me know your thoughts!
