ChatDIY : build your own ChatGPT at home

The plan

So far OpenAI seems to be in a great position, having had an incredible response to their ChatGPT release. Google now has Bard, Meta has had their Llama models leaked, open sources models are released by orgs such as Stabilty.AI and Open-Assistant. Everyone and their grandma is trying to conquer the world with their supermassive token predictors. Why not you too? Don’t you want to be the proud owner of your own ChatDIY and impress people at parties? Say no more, follow the instructions below.

In the next sections I’ll show you how to:

Wrap an open model into a service and expose it through an API,
And finally add it to some chat interface

Picking up your model

There is no shortage of new LLM being released these days. Those aspiring “foundation models” are supposed to be hard and costly to train, and yet we see new ones coming up every couple of days. The AI arms race is on, and the future will be either exciting or dystopian.

Billions of parameters is the new normal and even boring these days. Let’s remain reasonable and pick a small LLM for the sake of the experiment. Having experimented a lot with Stability.AI’s Stable Diffusion, I was quite curious to play with their own LLM, StableLM. And I was pleased to see that the great team behind Open-Assistant had released a fine-tuned version of StableLM.

7 billion shall be enough, right? Let’s see and start “small”.

Exposing your model with SimpleAI

Prerequisites

Nothing really surprising there, but you’ll need:

A relatively recent and powerful GPU
Python 3.9 or later

I provide a Docker image later, which is an easy way to make sure it is reproducible, so I’d recommend installing it if not done yet. Otherwise it is probably a good thing to use a venv (see here) as a good practice, if you don’t want to use containers.

I might be taking too close to my heart the last “S” in the KISS principle and there are probably smarter ways to do it, but I’m simply adding the massive model to the image, so you might want to increase the maximum container size in your Docker configuration (default is 20GB):

1
2
3
  "storage-opt": [
    "size=60G"
  ]

We will use several Python packages, including the latest version of SimpleAI:

1
python3 -m pip install git+https://github.com/lhenault/simpleAI

Creating a model service

So how can we use these Open-Assistant models? I’ve actually found it a bit tricky to find precise informations about it. This is a super interesting initiative, involving a lot of talented people, they’ve put an astonishing amount of work into training and releasing such models, yet you don’t have a clear path to use them. But it’s improving every day!

If you refer to their fine-tuned Pythya 12B model page, you’ll get some instructions for prompting:

Two special tokens are used to mark the beginning of user and assistant turns: <|prompter|> and <|assistant|>. Each turn ends with a <|endoftext|> token.

And even some example prompt:

1
<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>

It’s interesting to note that the roles here slightly differ from OpenAI’s API:

system isn’t an option and shall become assistant (?)
assistant remains the same
user is here called a “prompter”

Great, now we understand better what we want to achieve. Let’s start from there and use SimpleAI to expose a gRPC-based service (for a quick introduction about this project I’ve built, you can refer to my previous post).

First we need to “translate” from SimpleAI / OpenAI roles to Open-Assistant ones in model.py:

1
2
3
4
5
6
def preprocess_role(role: str):
    if role == "user":
        return "prompter"
    if role == "system":
        return "assistant"
    return role

We also have to format the received messages from a list of dict in the shape of {"role": "given role", "content": "message content"} to a input prompt (a str). To do so we will write our custom function:

1
2
3
4
5
6
7
def format_chat_log(chat: list[dict[str, str]] = dict()) -> str:
    raw_chat_text = ""
    for item in chat:
        raw_chat_text += (
            f"<|{preprocess_role(item.get('role'))}|>{item.get('content')}<|endoftext|>"
        )
    return raw_chat_text + "<|assistant|>"

And we’re done! Ok no I’m joking here, but trust me, we’ve almost done the most complicated part here. The rest is sort of boilerplate code where we:

Define an OpenAssistantModel class
Implement the stream method for chat stream responses (stream=true in OpenAI’s API)
Optional: implement a chat method for non-stream responses (stream=false in OpenAI’s API)

We don’t really need the chat method as we will later only rely on streaming responses, but it’s a good exercise to highlight the difference between the two.

So let’s get back to our model.py. We need to import first a few thing on top:

1
2
3
4
5
6
7
import logging
from dataclasses import dataclass
from threading import Thread

import torch
from simple_ai.api.grpc.chat.server import LanguageModel
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

Then let’s put somewhere the id of the model we’re using:

1
MODEL_ID = "OpenAssistant/stablelm-7b-sft-v7-epoch-3"

Then come our previously defined utilities preprocess_roles and format_chat_log, and we can create our class:

1
2
3
4
5
6
@dataclass(unsafe_hash=True)
class OpenAssistantModel(LanguageModel):
    gpu_id: int = 0
    device = torch.device("cuda", gpu_id)
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto").half()
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

We now have to implement our methods:

For chat you have to:
1. Format your chatlog,
2. Tokenize the inputs,
3. Pass this prompt to the model to generate a prediction
4. Decode the output and do a bit of post processing
5. Return the result as a list of messages [{"role": role, "content": output}]

Which is what is done below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
    def chat(
        self,
        chatlog: list[list[str]] = None,
        max_tokens: int = 512,
        temperature: float = 0.9,
        top_p: int = 0.5,
        role: str = "system",
        *args,
        **kwargs,
    ) -> str:
        logging.info(f"Preprocessing chatlog:\n{chatlog}")
        prompt = format_chat_log(chatlog)

        logging.info(f"Input prompt:\n{prompt}")
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=self.tokenizer.eos_token_id,
        )
        output = self.tokenizer.batch_decode(outputs)[0]
        logging.info(f"Model output:\n{output}")
        
        # Remove the context from the output
        output = output[len(prompt):]

        # Stop at "<|endoftext|>"
        if "<|endoftext|>" in output:
            output = output.split("<|endoftext|>")[0]
        return [{"role": role, "content": output}]

For stream, we follow roughly the same process, the only difference is that we will yield chunks as they come, leveraging TextIteratorStreamer and threading library:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
    def stream(
        self,
        chatlog: list[list[str]] = None,
        max_tokens: int = 512,
        temperature: float = 0.9,
        top_p: int = 0.5,
        role: str = "system",
        *args,
        **kwargs,
    ):
        yield [{"role": role}]

        logging.info(f"Preprocessing chatlog:\n{chatlog}")
        prompt = format_chat_log(chatlog)

        logging.info(f"Input prompt:\n{prompt}")
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        # Generate stream, yield delta
        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)
        generation_kwargs = dict(
            **inputs,
            streamer=streamer,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        logging.info("Output:")
        for delta in streamer:
            if delta:
                if "<|endoftext|>" in delta:
                    logging.info(delta)
                    yield [{"content": delta.split("<|endoftext|>")[0]}]
                    break
                logging.info(delta)
                yield [{"content": delta}]        

Now we simply have to start a gRPC server using this model. This can be done in another file server.py quite easily thanks to SimpleAI tools:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import logging

from model import OpenAssistantModel as Model
from simple_ai.api.grpc.completion.server import LanguageModelServicer, serve

if __name__ == "__main__":
    import argparse

    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument("-a", "--address", type=str, default="[::]:50051")
    args = parser.parse_args()

    logging.info(f"Starting gRPC server on {args.address}")

    model_servicer = LanguageModelServicer(model=Model())
    serve(address=args.address, model_servicer=model_servicer)

And you should be able to just start your server now:

1
2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:15<00:00,  1.74s/it]
INFO:simpleai:Starting gRPC server on [::]:50051

You can find the full example here, which also includes the aforementioned Dockerfile, and a bit of refactor to download model and tokenizer beforehand with get_models.py (you surely don’t want to download dozens of GB each time you start your container).

Adding your model to your SimpleAI server

You now need to declare your model in a model.toml file to make it available in your SimpleAI server:

1
2
3
4
5
6
7
8
[stablelm-open-assistant]
    [stablelm-open-assistant.metadata]
        owned_by    = 'OpenAssistant'
        permission  = []
        description = 'This is the 7th iteration English supervised-fine-tuning (SFT) model of the Open-Assistant project. It is based on a StableLM 7B that was fine-tuned on human demonstrations of assistant conversations collected through the https://open-assistant.io/ human feedback web app before April 12, 2023.'
    [stablelm-open-assistant.network]
        type = 'gRPC'
        url = 'localhost:50051'

Now if you run simple_ai serve and go to /docs, you should be able to see your model and even try it out. Congratulations!

Adding a chat interface to your model

If everything went correctly, you should now be able to query your model, either using the Swagger UI, curl command or a client such as OpenAI client. But no one (besides maybe geeks like us) really wants to interact with an AI using curl commands, right? A key component of the viral success of ChatGPT is the super nice and easy to use interface. After all, fine-tuning GPT3 through RLHF is mostly about giving you access to the same capabilities than GPT3, but in a more human-friendly manner.

UI matters. And the only one we have right now sucks. Let’s fix that!

Picking a UI

We’ll use there Chatbot UI, as:

It looks clean and nice (and even includes a dark theme)
Is under active development and is popular
Includes a docker image with clear instructions to use custom environment variables as we’ll want to override some (👋 OPENAI_API_HOST)
Uses a MIT license

It genuinely looked good enough to not need to reinvent the wheel. Pick your battles, implementing a chat interface isn’t rocket science if you know a few things about front end development, but you don’t have to at this stage.

Issues

Ok so the way it’s advertised, we should be able to simply run:

1
docker run -i -e OPENAI_API_HOST=http://<HOSTNAME>:8080 -e OPENAI_API_KEY=xxxxxxxx -p 3000:3000 -p 8080:8080 ghcr.io/mckaywrigley/chatbot-ui:main

And it would work, right? Well, no. If only things were that easy with code. A few issues we have to fix here:

So far the UI uses an enum for the models, which means we cannot pass an arbitrary model name:

1
2
3
4
5
6
export enum OpenAIModelID {
  GPT_3_5 = 'gpt-3.5-turbo',
  GPT_3_5_AZ = 'gpt-35-turbo',
  GPT_4 = 'gpt-4',
  GPT_4_32K = 'gpt-4-32k',
}

The default value for OPENAI_API_HOST is set to https://api.openai.com, meanwhile OpenAI client uses https://api.openai.com/v1. It’s a tiny difference but that /v1 will make or break the API calls, and to be consistent with the official OpenAI client, SimpleAI is also not including /v1 in the default endpoints.

Making it work

To address the enum thing, simply rename the model in models.toml to one of the allowed ones:

1
2
3
4
5
6
7
8
["gpt-3.5-turbo"]
    ["gpt-3.5-turbo".metadata]
        owned_by    = 'OpenAssistant'
        permission  = []
        description = 'This is NOT GPT-3.5-turbo. This is the 7th iteration English supervised-fine-tuning (SFT) model of the Open-Assistant project. It is based on a StableLM 7B that was fine-tuned on human demonstrations of assistant conversations collected through the https://open-assistant.io/ human feedback web app before April 12, 2023.'
    ["gpt-3.5-turbo".network]
        type = 'gRPC'
        url = 'localhost:50051'

Note: pay attention to the " surrounding the model name, as the . in it would create issues otherwise.

To add that infamous /v1 prefix, you have several options, including using a nginx reverse proxy. A very simple workaround is to not use simple_ai serve but a custom script sai_server.py and use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import uvicorn
from simple_ai.server import app as v1_app
from fastapi import APIRouter, FastAPI

sai_app = FastAPI()
sai_app.mount("/v1", v1_app)

def serve_app(app=sai_app, host="0.0.0.0", port=8080, **kwargs):
    uvicorn.run(app=app, host=host, port=port)
    
if __name__ == "__main__":
    serve_app()

Then simply run python3 sai_server.py instead. Give it a try with a curl command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ curl -X 'POST'   'http://127.0.0.1:8080/v1/chat/completions'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {"role": "prompter", "content": "Hey bro what'\''s up"}
  ],
  "temperature": 1,
  "top_p": 1,
  "n": 1,
  "stream": true,
  "stop": "",
  "max_tokens": 7,
  "presence_penalty": 0,
  "frequence_penalty": 0,
  "logit_bias": {},
  "user": ""
}'
data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"role": "system"}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"content": "Hello! "}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"content": "I "}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"content": "am "}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"content": "Open "}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": {"content": "Assistant,"}, "finish_reason": null}]}

data: {"id": "2c445ba71a214aea8473903ccdbbd6db", "model": "gpt-3.5-turbo", "object": "chat.completion.chunk", "created": 1683251350, "choices": [{"index": 0, "delta": "", "finish_reason": "stop"}]}

data: [DONE]

Now, if you’re running again the previous command:

1
docker run -i -e OPENAI_API_HOST=http://<HOSTNAME>:8080 -e OPENAI_API_KEY=xxxxxxxx -p 3000:3000 -p 8080:8080 ghcr.io/mckaywrigley/chatbot-ui:main

And go to http://localhost:3000/en, you should be able to start a conversation with your new AI best friend:

UI Screenshot

Just don’t fall in love with it.

Going further

All this is a cool little project to tinker around with, but you will have a hard time to compete with Big Tech with this. The model we pick is not really able to compete with ChatGPT, generations are pretty slow, but at least you might have gained some understanding of how it works behind the scenes.

You can now try to build upon this and perhaps:

Make the whole thing more streamlined by adding a kubernetes or docker-compose layer,
Build your own UI, or fork the suggested one, to support custom model names,
Add a second model in parallel to be able to compare the two in an interactive way,
Add some caching or authentication mechanism

Perhaps you’ll wait for true contenders with more capabilities than the one we’ve used there. Or perhaps a 7B or a 12B parameters model is already enough for your needs. I hope you’ve enjoyed the journey, let me know your thoughts!