One of the most difficult challenges for a machine learning engineering team is efficiently bringing to production the considerable work done during a model's research and development stages. Often, the whole development stage of a model is done within Jupyter Notebook, a tool focused on experimentation rather than building a deployable artifact.
In this post, you’ll see how to address this challenge. For our example we will take a LLama3 LLM model with Lora-adapters that was fine-tuned in a Jupyter Notebook, and then turn it into a deployable artifact using KitOps and ModelKit. This artifact will become a pod deployed in a Kubernetes cluster.
Why is a Jupyter Notebook not directly deployable to production?
A Jupyter Notebook is a web-based interactive tool that allows you to create a computational environment to produce documents containing code and rich text elements. This is the standard tool for research and development of a new machine learning model or a new fine-tuning methodology because Jupyter Notebook is focused on:
- The immediate observation of results. It is enabled by the interactive execution of code within designated cells. This facilitates an iterative problem-solving approach and experimentation.
- Containing the entire computational workflow, including the code, explanations, and outputs. This characteristic promotes the replicability and dissemination of research ideas.
- Collaboration with colleagues, fostering collaborative efforts, and exchanging knowledge.
While the Jupyter Notebook is the right tool for data exploration and prototyping, it is generally not ideal for direct deployment in production for a few reasons:
- The code execution order is not linear because cells can be executed in any order, which can lead to a non-linear flow of logic and hidden state (variables that change within the notebook). This makes it difficult to understand how the code works and troubleshoot issues in production.
- Notebooks are often single-threaded and not designed to handle high volumes of traffic. It wouldn't perform well in applications requiring real-time responsiveness or work efficiently across multiple nodes or clusters.
- Notebooks can execute arbitrary code and often contain sensitive information, raw code, and data, which can be a security vulnerability if not properly managed.
- Logging, monitoring, and integration with other systems are common in production environments, and Notebooks are not built for this type of integration.
There are many ways to deploy the model developed within your Jupyter Notebook, depending on the needs and team skills. However, in this article, we will walk you throughthe deployment of a Kubernetes init container using KitOps and ModelKit.
What are the stages to bring your model to production?
Using Jupyter Notebook, you can develop and/or fine-tune an AI/ML model, but a model needs to pass through specific stages before being deployed to production. For each stage, the whole team (data scientists, ML engineers, SRE) requires artifacts that can be immutable, secure, and shareable.
KitOps and ModelKits, an OCI-compliant packaging format, are designed to respond to this need. ModelKit standardizes how all necessary artifacts, including datasets, code, configurations, and models, are packaged, making the development process simpler and more collaborative.
The stages of the development process for an ML model are:
- Untuned: This is the stage where the dataset is designed for model tuning and model validation. At this stage, the ModelKit contains only datasets and the base model; for example, LLama3 instructs quantized 4.
- Tuned: At this stage, the model has completed the training phase. ModelKit would include the model, plus training and validation datasets, the Jupyter Notebook with the code for the fine-tuning, and other assets for the specific use case, like Lora-adapters for the LLM fine-tuning.
- Challenger: The model should be prepared to replace the current champion model. ModelKit would include any codebases and datasets that should be deployed to production with the model.
- Champion: The model is deployed in production. This is the same ModelKit as the challenger-tagged one but with an updated tag to show what's in production.
Next, you’ll see the step-by-step process of implementing these stages and creating immutable artifacts with Kitops.
Stage 1: Create the untuned ModelKit
To begin, create a project folder named llama3-finetune. Within this folder, create two subfolders: one for datasets and another for notebooks. If you haven’t installed the Kit CLI, follow this guide to install it.
Open your shell and download the LLama3 8 billion parameters, which will be fine-tuned from ModelKit:
kit login ghcr.io
kit unpack ghcr.io/jozu-ai/llama3:8B-instruct-q4_0
Another file is created in your folder: Kitfile. This is the manifest for your ModelKit and a set of files or directories, including adapters, a Jupyter Notebook, and a dataset.
Next, you are ready to embark on the first step of your MLOps pipeline: creating the dataset for training. If your goal is to fine-tune an open-source large language model like LLama3 for a chat use case, a widely used dataset is the HuggingFace ultrachat_200K.
Install the datasets package and download the dataset in a new Jupyter Notebook called finetune.ipynb:
#install necessary packages:
!pip install datasets
#Download the dataset:
from datasets import load_dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k")
Starting from this dataset, develop the code to adapt it for your particular use case.
In this use case, the dataset will be converted with the chat template; first, you need to install the transformers package:
#install necessary packages:
!pip install transformers
In the next cell of your Jupyter Notebook, use the apply_chat_template
to create the train and test datasets for the chat use case:
import os
from transformers import AutoTokenizer
model_id = 'AgentPublic/llama3-instruct-8b'
train_dataset = ds['train_gen']
test_dataset = ds['test_gen']
tokenizer = AutoTokenizer.from_pretrained(model_id)
def format_chat_template(row):
chat = tokenizer.apply_chat_template(row["messages"], tokenize=False)
return chat
os.makedirs('../dataset', exist_ok=True)
with open('../dataset/train.txt', 'a', encoding='utf-8') as f_train:
for d in train_dataset:
f_train.write(format_chat_template(d))
with open('../dataset/test.txt', 'a', encoding='utf-8') as f_test:
for d in test_text:
f_test.write(format_chat_template(d))
The datasets (train and test) are saved in the ./dataset
folder.
The structure of your project is now something like this:
- llama3-finetune
|-- dataset
|-- train.txt
|-- test.txt
|-- notebooks
|-- finetune.ipynb
|-- kitfile
|-- llama3-8B-instruct-q4_0.gguf
At this point, you need to change the kitfile to package everything you need in the following stages:
manifestVersion: "1.0"
package:
name: llama3 fine-tuned
version: 3.0.0
authors: [Jozu AI]
code:
- description: Jupyter notebook with dataset creation
path: ./notebooks
model:
name: llama3-8B-instruct-q4_0
path: ghcr.io/jozu-ai/llama3:8B-instruct-q4_0
description: Llama 3 8B instruct model
license: Apache 2.0
datasets:
- description: training set from ultrachat_200k
name: training_set
path: ./dataset/train.txt
- description: test set from ultrachat_200k
name: test_set
path: ./dataset/test.txt
The kitfile can now be pushed to your Git repository, and you can tag this commit as untuned,
the same tag as the ModelKit. In your terminal, write:
kit pack /llama2_finetuning -t registry.gitlab.com/internal-llm/llama3-ultrachat-200k:untuned
kit push registry.gitlab.com/internal-llm/llama3-ultrachat-200k:untuned
Step 2: Fine-tune the model
The second stage is to fine-tune the model in the Jupyter Notebook, developing the code to perform this. For an LLM, one possible solution is to fine-tune the lora-adapters with llama.ccp as shown below:
# finetune LORA adapter
./bin/finetune \\
--model-base open-llama-3b-v2-q8_0.gguf \\
--checkpoint-in chk-lora-open-llama-3b-v2-q8_0-ultrachat-200k-LATEST.gguf \\
--checkpoint-out chk-lora-open-llama-3b-v2-q8_0-ultrachat-200k-ITERATION.gguf \\
--lora-out lora-open-llama-3b-v2-q8_0-ultrachat-200k-ITERATION.bin \\
--train-data "./dataset/train.txt" \\
--save-every 10 \\
--threads 6 --adam-iter 30 --batch 4 --ctx 64 \\
--use-checkpointing
In the llama.cpp repository, you can find all the information you need to build it for your architecture.
When the fine-tuning is finished, you can add the lora-adapters part to your Kitfile. Thanks to this, the lora-adapter fine-tuned will be packetized in your artifact.
manifestVersion: "1.0"
package:
name: llama3 fine-tuned
version: 3.0.0
authors: [Jozu AI]
code:
- description: Jupyter notebook with dataset creation
path: ./notebooks
model:
name: llama3-8B-instruct-q4_0
path: ghcr.io/jozu-ai/llama3:8B-instruct-q4_0
description: Llama 3 8B instruct model
license: Apache 2.0
parts:
- path: ./lora-open-llama-3b-v2-q8_0-ultrachat-200K-LATEST.bin
type: lora-adapter
datasets:
- description: training set from ultrachat_200k
name: training_set
path: ./data/train.txt
- description: test set from ultrachat_200k
name: test_set
path: ./data/test.txt
This new version of the kitfile can be pushed to your Git repository, and also the KitModel for the stage tuned
can be packed and pushed. In your terminal, write:
kit pack /llama2_finetuning -t registry.gitlab.com/internal-llm/llama3-ultrachat-200k:tuned
kit push registry.gitlab.com/internal-llm/llama3-ultrachat-200k:tuned
You can run the fine-tune stage many times, changing training parameters or the fine-tuning strategy or both. For every run of fine-tuning, you can tag (i.e., with a version) a new ModelKit, in this way, you have all your fine-tuning attempts.
Stage 3: Create the challenger
When the metrics that matter most for your use case, such as accuracy, precision, recall, F1 score, or mean absolute error, exceed the thresholds you set, your fine-tuned model is ready to challenge the champion model; the ModelKit can be tagged as the challenger. In your terminal, do this with:
kit pack /llama2_finetuning -t registry.gitlab.com/internal-llm/llama3-ultrachat-200k:challenger
kit push registry.gitlab.com/internal-llm/llama3-ultrachat-200k:challenger
Note that at this stage, the ModelKit contains the model, the datasets, and the code necessary for the fine-tuning. In this way, anyone can fully reproduce the fine-tuning stage, and so can the challenger.
Now, the challenger model can be deployed to production for A/B testing and is ready to challenge the champion model. If the challenger model performs better, it becomes the champion. You are now ready to deploy the new version of your model to production.
Stage 4: Deploy the new champion
At this stage, you can retag the same challenger ModelKit, which contains the model (plus lora-adapters), the code, and the datasets, with a new tag: champion.
For the deployment to the production environment, for example, to Kubernetes through an init container, you can use an image with the kit CLI and unpack only the model to the shared volume of the pod:
kit unpack registry.gitlab.com/internal-llm/llama3-ultrachat-200k:champion --model -d /path/to/shared/volume
Now, the new champion is deployed to production, and every step of the MLOps pipeline, from the Jupyter Notebook to the deployment of the model in production, is completed.
Thanks to KitOps and ModelKit, the model's development, fine-tuning, and deployment process is fully reproducible. Each artifact created is immutable, shareable among team members, secure, and stored in your preferred container registry or a purpose built registry like Jozu Hub.