Don’t repeat yourself is a fundamental principle in computer science. As such, software engineers often use readily available code or pre-built solutions to commonly occurring problems. Consequently, it is common for large machine learning projects to rely on numerous other open source projects. For instance, transformers - a library commonly used to build transformer-based models - rely on over 1000s other open source projects.
Using open source projects to build your project has both upsides and downsides. Using suitable open source projects will accelerate the development of your product without having to make it from scratch. However, if there is no active development or a lack of support for your chosen open source projects, your project might get delayed.
To help you choose the right projects, this blog explores criteria you can use to gauge the reliability of open source projects and ten less-known open source projects that will immensely assist you in building robust machine learning products.
Gauging the reliability of open source projects
If you navigate to GitHub, you will find numerous open source projects that seem beneficial at first glance. However, not all of them are mature and ready to be used in commercial products. To determine whether or not you should use a specific open source library, you can evaluate it based on the criteria below:
- Documentation and support: Projects with detailed documentation and quick support (via GitHub, Slack, or Discord) are better than those without.
- Activity: Open source projects actively resolving pull requests and issues can be considered more reliable than archived or stale projects.
- Sponsors: Projects sponsored by other companies are reliable because contributors and maintainers are paid to build and improve the project actively.
- License: Some projects have licenses that prevent users from distributing and using the project for commercial use. Hence, it is wise to carefully read the “LICENSE” file before using it commercially.
Now that you know the criteria involved in assessing an open source project, let’s look at ten helpful open source MLOps projects that you may not have heard of.
Open source machine learning projects
-
KitOps
In machine learning projects, data engineers, data scientists, and machine learning engineers often struggle to collaborate. Handing off artifacts (features, models, code, dataset, etc.) involves a convoluted software process that can necessitate frequent meetings or pair-programming sessions. Many have tried to automate this process, but it requires numerous tools that can introduce additional complexities.
KitOps eliminates these complexities by treating all components in an ML project as a single software unit. This makes it easier to package, version, and track those components, resulting in smooth handoffs between different stakeholders.
Furthermore, with KitOps, it is possible to unpack individual components (data, model, or code) and work on them. This simplifies collaboration and dependency management. KitOps supports a majority of the tools in the ML ecosystem and has beginner-friendly documentation, making it easier to switch.
The team behind KitOps recently announced a second project called Jozu Hub, to host ModelKits. You can checkout the registry here and learn more about the Hub here.
EvalML
Hyperparameter tuning and evaluating ML models are integral aspects of ML product development. EvalML is an AutoML library that aims to ease the process of building, optimizing, and evaluating ML models by helping engineers avoid manual training and tuning of models. It also includes data quality checks and cross-validation.
-
Burr
Burr is a new library that focuses on applications of Large Language Models (LLMs). Burr enables developers to build decision-making agents (chatbots and simulations) using simple Python building blocks and deploy them on custom hardware. It includes three components:- A Python library that enables you to build and manage state machines with simple Python functions.
- A UI you can use for introspection and debugging.
- A set of integrations to help you integrate Burr with other systems.
-
Evidently
Evidently is an open source monitoring tool built by Evidently AI to help data scientists identify drifts in data, label, model performance changes, and run custom tests. Evidently works with tabular and textual data, including embeddings. In short, it is a tool to evaluate, test, and monitor machine learning models in production.
Evidently allows users to run pre-defined tests or create custom tests for their machine learning model. The computed test results and metrics can be displayed in an interactive dashboard, which can be hosted on public or private cloud infrastructure.
Rerun
The majority of the tools in the MLOps space focus on models and infrastructure. Unlike those tools, Rerun focuses on data, which is integral to any machine learning project. Rerun is a time-series database and visualizer for temporal and multimodal data. It’s used in robotics, spatial computing, 2D/3D simulation, and finance to verify, debug, and explain. Rerun provides SDKs (C++, Python, and Rust) to log data like images, tensors, point clouds, and text.
Agenta
Unlike other libraries in the list, Agenta is an end-to-end LLM developer platform. It allows engineers to collaborate on prompts, evaluate, and monitor LLM-powered applications. It also provides assistance with incorporating human feedback and deploying LLM-based apps. Agenta is model and library-agnostic, allowing engineers to use any library and model, which in turn makes this tool more versatile and useful.
Marimoarimo
Jupyter Notebook is an effective tool for experimenting with machine learning models, but it has a few problems: Jupyter Notebooks are hard to deploy and collaborate on. marimo is an open source reactive notebook for Python. Unlike Jupyter Notebooks, marimo notebooks are reproducible, interactive, git-friendly, easy to collaborate on, and can be deployed as scripts or apps. As a result, marimo notebooks are the perfect candidate for building internal tools like this model comparison tool.
-
Kedro
A serious problem with machine learning projects is the complex process involved in taking models from development to production. Kedro is an open source tool that solves this problem by employing software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.Additionally, kedro provides kedro-viz for visualizing machine learning workflows. It also allows you to create project templates to organize configuration, source code, tests, documentation, and notebooks within a project.
Skypilot
LLMs are extremely hot these days; companies and individuals are using them to automate boring tasks and improve efficiency and responsiveness. However, LLMs are cumbersome to deploy and incur high cost.
Skypilot is an open source tool that helps engineers LLMs, AI models, or any batch jobs in the cloud while ensuring high availability. Its [autostop and autodown feature](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html) automatically stop and delete resources when they are not in use, resulting in low cloud bills.
- Featureform The success of a machine learning model relies on the quality of data and, hence, the features fed to the model. However, in large organizations, members of one team may not be aware of good features developed by other teams in the organization. A feature store helps eliminate this problem by providing a central repository of features that are accessible to all the teams and individuals within an organization.
[Featureform is a virtual feature store](https://docs.featureform.com/getting-started/architecture-and-components#resource-and-feature-registry) that enables data scientists to define, manage, and serve their ML model's features. The virtual feature store sits on top of the existing data infrastructure and orchestrates it to work like a traditional feature store. Its benefits include easy collaboration, organized experimentation, easy deployment (of features), and increased reliability.
By now, it should be clear that ML projects rely on numerous open source projects to take a model from training to production. As such, it is crucial that you choose tools that integrate easily with other MLOps tools, have good documentation, and have shallow learning-curve. KitOps is one of such tools. Along with all these features, it is easy to install, has numerous tutorials, and includes enterprise support from Jozu.
If you've found this post helpful, support us with a GitHub Star!