As Nancy Sinatra once kind of said,
You keep saying you got something for me
Something you call working but confess
You’ve been a’messin' where you shouldn’t ’ve been a’messin'
And now no one else is getting all your best
Well, these models are made for predicting, and that’s just what they’ll do
Training models is fun, making them impactful is hard. Many (very) talented engineers and researchers I’ve been working with were excellent at finding and tweaking the right, complex and state of the art deep learning models, yet struggled to make the result of their work useful. It usually comes down to the fact that there are a lot of things to consider, from research papers to DevOps principles, SWE skills, data cleaning, system design, cloud computing, etc… It is very much expected that most ML people will lean towards one side of the spectrum, and pick the more experimental side of things (because it’s considered by many as “the fun part”). Add to this that too many organizations are understaffed and tight deadlines don’t leave enough room for optimization, automated testing and other “nice to have” things (not that I agree with this classification), and you see the problem coming.
Fortunately, we see more and more talks about MLOps, companies hiring dedicated teams, tools such as Kubeflow or mlflow being adopted and more mature, but we are still very far from the seamless deployment to production of every notebook and ML model trained by data scientists, researchers and engineers. Below, I try to introduce some good practices and approaches I’ve found to be effective over the years, and try to show a path from a messy notebook experiment to a relatively useful image that can be integrated and deployed for instance on a Kubernetes-based infrastructure.
Notes:
- Assuming below your project is in Python3, as it is quite standard in this field.
- On purpose not considering nor covering here some key aspects such as networking, security, system design, hardware optimizations or specific tools and frameworks for inference
Introducing the ML pizza
You’ve probably heard of the “ML lifecycle architecture diagram” (and if not, see for instance here). Basically it’s just explaining you that there are several phases and components in every ML project, from ideation and business goals to production. What is often overlooked is that what matters isn’t the exact breakdown of it in different phases, but rather the fact that it’s an endless iterative process. Your work isn’t over once you reach production, as you will probably encounter things such as new business goals, data drift, changes in data pipelines, etc… Sometimes, you might want to build the whole pipeline to deploy a dummy baseline, and only then start to improve the model itself. Because of that, it is often represented as a loop: once you’re into production stage and monitoring your service, you now get new insights, reconsider your business goals, return to stage one and start over.
So, in order to contribute to the debate and trying to become myself a thought leader, let me introduce to you the concept of ML Pizza.
It’s basically the same thing, but tastier than a bland, basic circle. Also, a ML project is like a pizza:
- It’s good so once it’s over you will probably have another one and start over soon
- A ML product is usually an iterative process, you’re not done forever once it reaches production stage
- When it’s too large, you probably want to share it with others
- Maybe you give the “deploy” slice to your colleagues, perhaps you’ll have to eat it all alone
- You don’t always eat the pizza sequentially, sometimes you’ll start with a slice there, then take another one a bit further
- Maybe you’ll eat the “train and tune model” slice (because you have some parquet files to test some approaches on a static dataset) before the “data pipeline” one (because you might be waiting for someone to grant you access to the database)
- What matters is to empty the plate, whatever the order. Perhaps you can eat the crust at the end, leave out a few toppings in a first round, eat them later
- You might want to build a baseline before going deeper
- You might prefer to keep some optimizations and minor features for later, in order to cover the essential bits first
That’s it. It ain’t much but not much worse than concepts like “data lake”. Do whatever you want with it. In the next sections, I will introduce some tips and strategies to not only eat the “develop model” slice on which most engineers tend to focus, but to also easily eat the nearby ones. Either alone or with others, the idea is to eat them all!
Do not always start with a Jupyter notebook
Build pipelines first, model later
I’ve personally found over the years that:
- Training a model is the fun and easy part usually
- And the first iteration is rarely the right one
As a consequence, I like to build the backbone first, meaning the data and deployment pipelines, some wrappers for training, evaluating and tuning hyperparameters, before setting up a baseline, and finally trying to find more performance by playing with different models and learning strategies. It might take some time at first, but if your goal is to build a ML product that should last, it really helps and pays dividends in the long term.
On the other hand, if you don’t do this initial investment, you might end up fighting fires every two weeks and spend 4 days for each new deployment, and we can all agree that you could have better use of your time. By starting too soon with a notebook -I don’t mean a basic proof of concept or a mandatory EDA stage- you sometimes focus on the easiest challenge first.
Focus on results
Models are just a way to achieve a goal. By jumping too quickly to the modelling part, we sometimes tend to forget the end goal, and to correctly assess the impact of our solution. For a model, what matters is how it performs in real-world conditions: different data distribution (over time), economical considerations (does it run on a CPU or on dozens of GPU), latency, throughput, … For a business, it is quite unlikely that something like F1-score is the best way to measure the impact. Do not go into the rabbit hole of the marginal optimizations too far too soon.
Ok but what if you already have one?
Sometimes the tricky part is to get access to the right dataset or to annotate it: you are quite confident in your ability to reach some performance level if you have access to this (imagine a binary classification task like puppy or truck). But sometimes things are more experimental: you have your dataset, but have no idea if you can train a good model.
Then there you don’t start with plumbing, you usually try different approaches, often through a notebook. Once you’re happy with your early results, you can go sell your project and get the green lights from your management. Your PoC is here, what’s next?
Make it work, step by step
Jupyter notebooks can be great, but are usually used in a terrible ways. One of the major problems I’ve encountered is the lack of reproducibility. Raise your hand if you have ever seen something similar to this scenario:
- You find a super promising notebook, almost curing cancer and solving AGI
- First cell isn’t executed
- Second one shows
In [7]
, why not - Getting worse and worse, following ones have your lottery numbers: 32, 45, 12, 11, 3, 4
- Might be your lucky day, you decide to run the notebook top to bottom:
- Second cell screams at you that you are missing a library. You decide to fix this using
!pip install random-missing-package
. - It fails to install it because it cannot compile some obscure C extension.
- After one hour of furiously googling errors, it’s finally installed, you can go to the next cell.
- When you try to execute it, it fails because the function
do_something_ridiculous
is only defined in the penultimate cell - You now consider quitting your job, throwing away any electronic device and move to Peru to raise alpacas
- Second cell screams at you that you are missing a library. You decide to fix this using
There is a lot to unwrap here, and we will try to address the main issues below, but at the very least, ensure that you can successfully run from top to bottom your notebook without any major issue.
Don’t be (only) a caveman
Many people in ML and data are tinkerers at heart, and come from quantitative backgrounds, where code quality isn’t usually the first and main concern. As a result, a lot of us just like to glue stuff together, try different things till it works as expected and then forget about the rest (because we want to rather use our time reading papers and training the next chatGPT). There are yet a lot of benefits of doing this, from making your work more maintainable to enabling collaboration and avoiding future technical debt. Yes you might be able to ship something a bit dirty but functional and call it a day, but chances are it will come back at you at the worst possible time and will require much more work to fix it or integrate it later on.
Refactor and enhance your code
You’ve spent days and days trying the craziest ideas, and you finally have something that works well. Congrats! Now time to clean your mess, innit?
- If it doesn’t spark joy, remove it! I assume at this point that you are using a modern version control system such as Git, and have pushed your changes regularly, so perhaps you do not need these hundreds of commented lines, or the
dwnloard_data_from_drive__tmp
function you’ve written the first day and don’t plan to use anymore. - Use good practices and patterns, don’t forget to fix this
DATAPATH = /home/user/data/traim.csv
you’ve copy pasted everywhere for convenience, make it a parameter, etc… - Remove obvious performace bottlenecks (e.g switching to multithreading or multiprocessing when relevant)
- Use modern features of Python: use types, decorators, dataclasses… while avoiding overengineering at this stage
- Make it beautiful. Perhaps a tad subjective but readable and well organized code goes a long way, is easier to maintain and to contribute to. You can use tools such as pylint, which can even be integrated in your favorite IDE, so no excuse here
I am not the biggest fan of notebooks, as they can be (and usually are) misused, but they have their use cases. They especially shine for demos, and as a basic UI for exploration and experimentation. A pattern I’ve found quite effective is to use it as a “front end” to your code, defining classes, functions in a proper Python project, and only import what is needed in a nicely commented Jupyter notebook.
Wrap it up
Ok so at this point you should have something you’re relatively proud of. Code looks leaner and cleaner, you’ve adressed the main issues, even added a README.md
and pushed your changes. Congrats. Now what’s next? You need some interface to expose your model and access some .predict()
method, and therefore probably have to rely on either:
- A REST API endpoint, probably using a
GET
method, that you can implement with a framework such as starlette, FastAPI, or starlite, - Something based on gRPC,
- Or something rather simple using the command line, such as
python3 predict.py --input /path/to/file --output /path/to/predictions/directory
that you can write withargparse
from standard library, or something like Fire
At this point you should definitely engage with other people to understand the best way to proceed and the relevant use cases (e.g. do you need batch prediction?). Write your interface, you might need to also add a few new classes to handle this. And keep in mind that even if we are here focusing on the inference / deployment part of a ML project, this is also useful and sometimes necessary for training (e.g hyperparameters search, manually or using tools like SageMaker or Kubeflow’s katib module).
Make it reproducible
“Code is good, it works on my machine” is one of the greatest fallacies ever and in our industry. At this point, you probably have a requirements.txt
somewhere (hopefully not a bunch of !pip install tensorflow
here and there in your notebook). Avoid any potential issues by specifying the versions, e.g not:
|
|
But rather something like:
|
|
From there you should be able to successfully build a Docker image using a Dockerfile
:
|
|
Then start a container and run your code. Basically the idea here is to ensure that you can build your project from scratch and run it without encountering issues such as missing dependencies, broken path, missing permissions and all the other usual suspects.
Training vs inference
In most situations, there a clear separation between training a model and using it at inference time. That means you could probably have different images, to minimize things like number of dependencies or space:
- You probably do not need to add every file in your inference service (đź‘‹ gigabytes of training data, model checkpoints, useless packages)
- Likewise, if you expose your model through some REST or gRPC API, you probably don’t need the code and related dependencies for training
- Hardware requirements might be different (do you need that base image with CUDA, and all the GPU related things if CPU is enough at inference?)
I personally like to use different Dockerfile
and requirements.txt
for relatively complex projects, even if that is not always necessary and can be some sort of premature optimization. Use your common sense here, if there is only one lightweight library specific to a phase, it might not be worth the hassle.
Logging, testing, damage mitigation
Because it’s never as seamless as expected.
Test stuff
Test your project, but keep things separated: split testing the code itself versus testing the model itself. Verifying if a function accepts integers or checking confusion matrix values for a model are not the same, and therefore shouldn’t be tested at the same time, nor at the same place.
Code
Test your code. Yes, not very original nor very exciting, but adding automated tests to your project can be very valuable. You will end up being much more confident pushing some changes, and peace of mind is priceless. Pytest is great, no excuse.
Data
Test your data pipelines. Many things can go wrong, and you will go nowhere without good data. Don’t assume anything, especially that:
- Data is the same and following the same distribution for training, testing and production,
- Things won’t change over time, it was good 2 months ago for first release so it should be still good
Make sure that your pipelines work, check data types, format, distributions, look for data drifts and leaks, etc… As you have to do these tests regularly, automate the most critical ones as soon as you can.
Model
Make the tests a standard and mandatory part of your pipeline. I’ve seen too many people pushing to production overfitted models, or not being able to compare properly different models because they didn’t have a standard benchmark (which should include a standard dataset with a split between training and validation). Be able to re-build your dataset, keep track of your data.
Monitor and log performance
Use logging to monitor the performance of your system. Which means not only the accuracy of your model, but things like:
- Error (
try... Except
is powerful, make your program resilient) - Inference time
- Service level, hardware consumption, network latency…
- All other extra steps (preprocessing, formatting output, database operations, …)
Your deployed services should be stateless most of the time, so try not to log all this within the container, that might be deleted or crashing at some point (especially if you treat servers as cattle). Rather rely on tools like Kibana (see Loki) for this, that should anyway be already there in your infrastructure.
Final words
There is no silver bullet and we are in a rapidly changing landscape, especially regarding MLOps technologies and practices. Some of these advices might not apply to you or every situation, but should provide some leads to help you go from a messy experiment to something running and doing what it has been designed for.
Feel free to contact me for suggestions or corrections.