When to Dockerize vs. When to use ModelKit

Jesse Williams - May 30 - - Dev Community

ML development can often be a cumbersome and iterative process, with many open source tools, built to handle specific parts of the machine learning workflow. As a result, working on machine learning projects is becoming increasingly complex and challenging.

For software projects, Docker has been used to alleviate these issues by providing an isolated environment and bundling code and dependencies as a single unit. However, the Docker approach fails to address many of the intricate aspects of an ML project, especially versioning, packaging data, and model artifacts.

Recently an alternative tool, where each component (data, artifacts, code) is treated as a separate entity, called ModelKits, has emerged. ModelKits bundle all assets involved in an ML project into one shareable artifact, making it easier to track, share, and collaborate.

This blog explores the pros and cons of using Docker for ML projects and when you should consider using ModelKits.

Docker

Docker is an open source platform for developing, shipping, and running applications. Since its release in 2013, Docker has gained popularity among developers (including ML engineers) as a go-to tool for packaging software and its dependencies. The reasons behind its popularity include:

  • Consistency: Docker containers (or simply containers) run consistently regardless of the change in the host system.
  • Portability: Containers can run on any platform that supports Docker.
  • Isolation: Containers isolate software and its dependencies from external applications running on the host.
  • Microservice architecture: Docker makes it possible to build and deploy microservices quickly.

Docker and machine learning

The reproducibility, isolation, scalability, and portability offered by Docker make it an attractive tool for ML engineers. However, using Docker may not always be the best choice for ML projects due to the following:

  • Lack of version control for model and data
  • Bloated container images
  • Difficulty managing dependencies
  • Complexity of ML projects

Let’s go through each of them.

Lack of version control for model and data
An ML project involves numerous assets and artifacts - code, data, model parameters, model hyperparameters, configuration, etc. Each asset is stored separately: code in a git repository, serialized model in a container, dataset in a file storage service (S3 or similar), and model artifacts in an MLOps tool (like MLflow).

While Docker makes it easy to package project contents, it doesn’t allow developers to track the changes made to the package's contents. In a dynamic project where code, data, and model artifacts are frequently updated, linking a specific version of one component (like code) with others (data and model) can only be done by embedding them all in a single container (see the next section for why this is a problem).

Bloated container images

If you try and manage version dependencies by packaging all the artifacts in a container (model, code, and datasets) you end up with containers that can easily grow into the 10s or 100s of gigabytes. That’s the opposite of the intention of containers and makes them hard to work with, especially since you have to pull the whole container, even if you only need the model or just the dataset.

Even in the simplest case, machine learning projects often depend on large libraries such as TensorFlow and PyTorch. These libraries have their own dependencies, such as Numpy, Keras, etc. Including all these dependencies produces large Docker files, which impact the portability and increase the deployment time of these Docker containers. For example, a docker image containing only TensorFlow with GPU support occupies over 3.5 GB of disk space, even before the model or data is added to it. An image this massive is time-consuming to download and distribute.

Difficulty managing dependencies

It is common for ML projects to have a large number of dependencies (frameworks, libraries). Each can have its dependencies, making managing them complex within a Dockerfile. Additionally, the underlying OS may not be compatible with the installed dependency version, resulting in errors and failures during deployment.

Complexity of ML projects

Machine learning is an iterative process and involves numerous steps. Each step introduces additional configuration and setup. For instance,

  • The training step requires tracking model parameters, error functions, learning rates, etc., to find the best ones.
  • Testing and validation require their own (separate) datasets, as well as keeping track of the model used and the metrics (accuracy, error, etc.)
  • Monitoring requires tracking properties of data and labels, such as their mean, number of missing values, erroneous inputs, etc.

Additionally, each step may necessitate creating a separate container, making the process complex. Docker’s success came with microservices - small, simple services. It wasn’t designed to support complex ML workflows and the large models and datasets they require.

What about end-to-end MLOps tools?

Numerous MLOps tools (MLflow, Neptune, Sagemaker, etc.) have emerged to solve the above issues, but they have a serious flaw. They require all artifacts and data to be in stored in the tool, and require all changes to happen through the tool. They have no way to control changes outside their tools, which is functionally equivalent to not having version control.

Furthermore, most of these tools aim to tie customers to a specific vendor by introducing proprietary standards and formats. This can introduce problems in the future when customers want to switch or use a tool not supported by the particular MLOps platform or vendor.

So, are there any alternatives?

The above limitations have given rise to KitOps, an open source project designed to enhance collaboration among stakeholders in a machine learning team.

KitOps revolves around ModelKits - an OCI-compliant packaging format that enables the seamless sharing of all necessary artifacts involved in an ML project. These ModelKits are defined using a Kitfile, which is more intuitive than a Dockerfile. A Kitfile is a configuration file written in YAML. It defines models, datasets, code, and artifacts along with some metadata. This article won’t go into the details of Kitfile, but you can find relevant information here. A sample Kitfile is provided below:

manifestVersion: v1.0.0
package:
  authors:
  - Jozu
  name: FlightSatML

code:
- description: Jupyter notebook with model training code in Python
  path: ./notebooks

model:
  description: Flight satisfaction and trait analysis model using Scikit-learn
  name: joblib Model
  path: ./models/scikit_class_model_v2.joblib
  version: 1.0.0

datasets:
- description: Flight traits and traveller satisfaction training data (tabular)
  name: training data
  path: ./data/train.csv
Enter fullscreen mode Exit fullscreen mode

Users can use Kit CLI to package their ML project into a ModelKit, then interact with components of ModelKit. For instance,

  • kit pack packages the artifacts
  • kit unpack lets you get only a part of the package - just the model, or datasets, or just the notebook, from a remote registry
  • kit pull pulls everything from the remote registry

Advantages of using ModelKit include:

  • Version-controlled model packaging
    ModelKit combines code and model artifacts into a single package and allows tagging, easing the process of sharing and tracking these components. Furthermore, stakeholders can unpack individual components or the entire package using a single command.

    For example, a data scientist can unpack only the model and dataset, while an MLOps engineer can unpack relevant code and related artifacts for testing and deployment.

    This integration makes it easier to manage files, resulting in easy collaboration and speedy development and deployment.

  • Improved security
    Each ModelKit includes an SHA digest for the associated assets and can be signed. This makes it easy to detect changes made to any of the assets and, hence, identify any tampering.

  • Future-Proofed
    Unlike vendor-specific tools that try to lock in customers, ModelKit offers a standards-based and open source solution for packaging and versioning. They can be stored in any OCI-compliant registry (like Docker Hub, Artifactory, GitLab, or others), support YAML for configuration, and support other MLOps and DevOps tools (like HuggingFace, ZenML, Git, etc.). This makes ModelKit a widely compatible and future-proof solution for packaging and versioning ML projects.

  • Lightweight deployment and efficient dependency management
    While Docker requires including a heavy base image, ModelKits allows developers to include only the essential assets: code, datasets, serialized models, etc. This results in lightweight packages that are quick and easy to deploy.

    Furthermore, the dependencies can be shipped along with the code, making it easier to specify and manage dependencies.

Thus, unlike general-purpose tools, ModelKits treats machine learning assets as first-class citizens, addressing specific needs such as packaging, versioning, environment configuration, and efficient dependency management. This focus ensures that the unique challenges of machine learning projects are met more effectively than with general-purpose tools like Docker files.

When to use ModelKit

Certain use cases make ModelKits desirable. Let's explore a few of them.

Lightweight application development
In software engineering, lightweight applications and packages are preferred as they are easy to share and deploy. Adopting ModelKits allows teams to build lighter packages and minimize the use of external tools. For instance, instead of using separate tools for tracking data, models, and code, teams can now rely on ModelKit - a single tool. The same is true for packaging. This results in smaller applications while saving time.

Integration with existing DevOps pipelines
Introducing a new tool into your existing DevOps pipeline often results in reduced productivity due to steep learning curve and integration challenges. However, ModelKit relies on open source standards like YAML to specify models, datasets, code, etc., which are already common among developers. ModelKit stores its assets in an OCI-compatible registry, which makes them compatible with the tools, registries, and processes that most organizations already use.

It’s important to note, however, that ModelKits aren’t meant to replace containers for production workloads. Instead, most organizations will treat the ModelKit as the storage location for the serialized model that can then be packaged into a container or side-car as preferred through pipelines that already work with the OCI standard.

Version-controlled model packaging
If your team is tired of using a separate tool to package and track individual components(code, dataset, model) in an ML project, they will greatly benefit from using ModelKit. With ModelKit, you can package code, the dataset, and the model generated using the code and dataset together and tag them with a version.

Docker makes it extremely convenient to deploy and share applications. Its isolation, portability, and integration with major cloud providers make it even more enticing for machine learning. However, using Docker requires introducing numerous tools to effectively version and package machine learning projects.

So, if you want a simple tool that efficiently versions and packages ML projects while supporting containers, ModelKit is the way to go. Adopt ModelKit in your workflow by following the quick start guide and experimenting with KitOps.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .