The plan
So far OpenAI seems to be in a great position, having had an incredible response to their ChatGPT release. Google now has Bard, Meta has had their Llama models leaked, open sources models are released by orgs such as Stabilty.AI and Open-Assistant. Everyone and their grandma is trying to conquer the world with their supermassive token predictors. Why not you too? Don’t you want to be the proud owner of your own ChatDIY and impress people at parties? Say no more, follow the instructions below.
In the next sections I’ll show you how to:
- Wrap an open model into a service and expose it through an API,
- And finally add it to some chat interface
Picking up your model
There is no shortage of new LLM being released these days. Those aspiring “foundation models” are supposed to be hard and costly to train, and yet we see new ones coming up every couple of days. The AI arms race is on, and the future will be either exciting or dystopian.
Billions of parameters is the new normal and even boring these days. Let’s remain reasonable and pick a small LLM for the sake of the experiment. Having experimented a lot with Stability.AI’s Stable Diffusion, I was quite curious to play with their own LLM, StableLM. And I was pleased to see that the great team behind Open-Assistant had released a fine-tuned version of StableLM.
7 billion shall be enough, right? Let’s see and start “small”.
Exposing your model with SimpleAI
Prerequisites
Nothing really surprising there, but you’ll need:
- A relatively recent and powerful GPU
- Python 3.9 or later
I provide a Docker image later, which is an easy way to make sure it is reproducible, so I’d recommend installing it if not done yet. Otherwise it is probably a good thing to use a venv
(see here) as a good practice, if you don’t want to use containers.
I might be taking too close to my heart the last “S” in the KISS principle and there are probably smarter ways to do it, but I’m simply adding the massive model to the image, so you might want to increase the maximum container size in your Docker configuration (default is 20GB):
|
|
We will use several Python packages, including the latest version of SimpleAI
:
|
|
Creating a model service
So how can we use these Open-Assistant models? I’ve actually found it a bit tricky to find precise informations about it. This is a super interesting initiative, involving a lot of talented people, they’ve put an astonishing amount of work into training and releasing such models, yet you don’t have a clear path to use them. But it’s improving every day!
If you refer to their fine-tuned Pythya 12B model page, you’ll get some instructions for prompting:
Two special tokens are used to mark the beginning of user and assistant turns:
<|prompter|>
and<|assistant|>
. Each turn ends with a<|endoftext|>
token.
And even some example prompt:
|
|
It’s interesting to note that the roles here slightly differ from OpenAI’s API:
system
isn’t an option and shall becomeassistant
(?)assistant
remains the sameuser
is here called a “prompter”
Great, now we understand better what we want to achieve. Let’s start from there and use SimpleAI to expose a gRPC-based service (for a quick introduction about this project I’ve built, you can refer to my previous post).
First we need to “translate” from SimpleAI / OpenAI roles to Open-Assistant ones in model.py
:
|
|
We also have to format the received messages from a list of dict
in the shape of {"role": "given role", "content": "message content"}
to a input prompt (a str
). To do so we will write our custom function:
|
|
And we’re done! Ok no I’m joking here, but trust me, we’ve almost done the most complicated part here. The rest is sort of boilerplate code where we:
- Define an
OpenAssistantModel
class - Implement the
stream
method for chat stream responses (stream=true
in OpenAI’s API) - Optional: implement a
chat
method for non-stream responses (stream=false
in OpenAI’s API)
We don’t really need the chat
method as we will later only rely on streaming responses, but it’s a good exercise to highlight the difference between the two.
So let’s get back to our model.py
. We need to import first a few thing on top:
|
|
Then let’s put somewhere the id of the model we’re using:
|
|
Then come our previously defined utilities preprocess_roles
and format_chat_log
, and we can create our class:
|
|
We now have to implement our methods:
- For
chat
you have to:- Format your chatlog,
- Tokenize the inputs,
- Pass this prompt to the model to generate a prediction
- Decode the output and do a bit of post processing
- Return the result as a list of messages
[{"role": role, "content": output}]
Which is what is done below:
|
|
For stream
, we follow roughly the same process, the only difference is that we will yield
chunks as they come, leveraging TextIteratorStreamer and threading
library:
|
|
Now we simply have to start a gRPC server using this model. This can be done in another file server.py
quite easily thanks to SimpleAI tools:
|
|
And you should be able to just start your server now:
|
|
You can find the full example here, which also includes the aforementioned Dockerfile
, and a bit of refactor to download model and tokenizer beforehand with get_models.py
(you surely don’t want to download dozens of GB each time you start your container).
Adding your model to your SimpleAI server
You now need to declare your model in a model.toml
file to make it available in your SimpleAI server:
|
|
Now if you run simple_ai serve
and go to /docs
, you should be able to see your model and even try it out. Congratulations!
Adding a chat interface to your model
If everything went correctly, you should now be able to query your model, either using the Swagger UI, curl
command or a client such as OpenAI client. But no one (besides maybe geeks like us) really wants to interact with an AI using curl
commands, right? A key component of the viral success of ChatGPT is the super nice and easy to use interface. After all, fine-tuning GPT3 through RLHF is mostly about giving you access to the same capabilities than GPT3, but in a more human-friendly manner.
UI matters. And the only one we have right now sucks. Let’s fix that!
Picking a UI
We’ll use there Chatbot UI, as:
- It looks clean and nice (and even includes a dark theme)
- Is under active development and is popular
- Includes a docker image with clear instructions to use custom environment variables as we’ll want to override some (👋
OPENAI_API_HOST
) - Uses a MIT license
It genuinely looked good enough to not need to reinvent the wheel. Pick your battles, implementing a chat interface isn’t rocket science if you know a few things about front end development, but you don’t have to at this stage.
Issues
Ok so the way it’s advertised, we should be able to simply run:
|
|
And it would work, right? Well, no. If only things were that easy with code. A few issues we have to fix here:
- So far the UI uses an
enum
for the models, which means we cannot pass an arbitrary model name:
|
|
- The default value for
OPENAI_API_HOST
is set tohttps://api.openai.com
, meanwhile OpenAI client useshttps://api.openai.com/v1
. It’s a tiny difference but that/v1
will make or break the API calls, and to be consistent with the official OpenAI client, SimpleAI is also not including/v1
in the default endpoints.
Making it work
- To address the
enum
thing, simply rename the model inmodels.toml
to one of the allowed ones:
|
|
Note: pay attention to the "
surrounding the model name, as the .
in it would create issues otherwise.
- To add that infamous
/v1
prefix, you have several options, including using anginx
reverse proxy. A very simple workaround is to not usesimple_ai serve
but a custom scriptsai_server.py
and use:
|
|
Then simply run python3 sai_server.py
instead. Give it a try with a curl
command:
|
|
Now, if you’re running again the previous command:
|
|
And go to http://localhost:3000/en, you should be able to start a conversation with your new AI best friend:
Just don’t fall in love with it.
Going further
All this is a cool little project to tinker around with, but you will have a hard time to compete with Big Tech with this. The model we pick is not really able to compete with ChatGPT, generations are pretty slow, but at least you might have gained some understanding of how it works behind the scenes.
You can now try to build upon this and perhaps:
- Make the whole thing more streamlined by adding a
kubernetes
ordocker-compose
layer, - Build your own UI, or fork the suggested one, to support custom model names,
- Add a second model in parallel to be able to compare the two in an interactive way,
- Add some caching or authentication mechanism
Perhaps you’ll wait for true contenders with more capabilities than the one we’ve used there. Or perhaps a 7B or a 12B parameters model is already enough for your needs. I hope you’ve enjoyed the journey, let me know your thoughts!