Data Quality, Governance, Observability, and Disaster Recovery are issues that are still trying to discover best practices in the world of the data lakehouse. A new trend is rising, borrowing from the practices used by software developers to manage these issues with code bases. This trend is called "Data as Code". Many of the practices this trend is trying to bring to the Lakehouse include:
- Versioning to enable isolating work on branches or marking particular reproducible states through tagging
- Commits to enable time travel and rollbacks
- The ability to use branching and merging to make atomic changes to multiple objects at the same time ( in data terms, multi-table transactions)
- Capturing data in commits to build audibility of who is making what changes and when
- The ability to govern who can access
- Automating the integration of changes (Continuous Integration) and automating publishing of those changes (Continuous Deployment) via CI/CD pipelines
Project Nessie
Several solutions are arising in approaching this problem from different layers, such as the catalog, table, and file levels. Project Nessie is an open-source project that solves these problems from the catalog level. Benefits of Nessie's particular approach:
- Isolate ingestion across your entire catalog by branching it, allowing you to audit and inspect data before publishing without exposing it to consumers and without having to make a "staging" copy of the data (branches do not create data copies like git branches don't duplicate your code).
- Make changes to multiple tables from a branch, then merge those changes as one significant atomic multi-table transaction.
- If a job fails or works in unintended ways, instead of rolling back several tables individually, you can roll back all your tables by rolling back your catalog.
- Manage access to the catalog, limiting which branches/tables a user can access and what kind of operations they can run on it.
- Commit logs can be used as an audit log to have visibility to your catalog updates.
- Nessie operations can all be done via SQL, making it more accessible to data consumers.
- Portability of your tables as they can be accessed by any tool with Nessie support, such as Apache Spark, Apache Flink, Dremio, Presto, and more.
Project Nessie Resources
Tutorials:
- Getting Started with Project Nessie, Apache Iceberg, and Apache Spark Using Docker
- A Notebook for getting started with Project Nessie, Apache Iceberg, and Apache Spark
- What is Project Nessie & how to start a Nessie Server
- Create a Data Lakehouse with Apache Iceberg, Project Nessie and Apache Spark
- Creating a Local Environment with Nessie, Spark and Apache Iceberg
Dremio Arctic
While you can deploy your own Nessie server, you can have a cloud-managed one with some extra features using the Dremio Arctic service. Beyond the amazing catalog-level versioning features that you get with having a Nessie catalog for your tables, Dremio Arctic also provides:
- Automatic table optimization services
- Easy and Intuitive UI to view commit logs, manage branches, and more
- Easy integration with the Dremio Sonar Lakehouse query engine
- Zero Cost to get a catalog up and running in moments with a Dremio Cloud account
Dremio Arctic Resources
- Dremio Arctic Demo Video
- Introduction to Arctic: Managing Data as Code with Dremio Arctic – Easily ensure data quality in your data lakehouse
- Managing Data as Code with Dremio Arctic: Support Machine Learning Experimentation in Your Data Lakehouse
- Multi-Table Transactions on the Lakehouse – Enabled by Dremio Arctic
- Automatic Iceberg Table Optimization with Dremio Arctic | Apache Iceberg
- Gnarly Data Waves Episode 8 - Managing Data as Code
- EP15 - Getting Started with Dremio: Sonar | Arctic Live Demo and Customer Use Cases
- EP16 - Easy Data Lakehouse Management with Dremio Arctic’s Automatic Data Optimization
- Dremio Cloud Tutorial: How to Set Up and Use an Arctic Project
- Dremio Arctic with a Notebook
CI/CD
Essentially you can create automated pipelines that take advantage of Nessies branching using any tool that supports Nessie for example:
- Orchestration Tools
- CRON Jobs
- Severless Functions
These mechanisms can be used to send instructions to Nessie supporting tools like Dremio and Apache Spark. For example:
- Data Lands on S3 triggering a python scripts that sends the appropriate SQL queries to Dremio via Arrow Flight, ODBC or REST
- A pySpark script that runs on a schedule sending instructions to a Spark script
The jobs would follow a similar pattern too:
- Create a branch
- Switch to the branch
- make updates
- validate updates
- if validations are successful, merge changes
- if validations fail, generate error with details for remediation (consumers never exposed to inconsistent or incorrect data)