Why I Built a Developer-First Apache Polaris Starter Kit ?
As builders, we all know the pain of setting up a new development environment. Hours spent configuring dependencies, troubleshooting integration issues, and getting different services to play nicely together. When I started working with Apache Polaris, I faced these same challenges – and decided to do something about it.
The Challenge: Getting Started with Apache Polaris
Apache Polaris is a powerful open source Iceberg REST catalog implementation, originally contributed to the Apache Software Foundation by Snowflake. This donation to open source has made enterprise-grade data catalog capabilities accessible to the broader data community via simple REST APIs.
Setting up Polaris in a development environment can be challenging. You need:
- A robust container orchestration platform
- A working metastore (typically PostgreSQL)
- S3-compatible storage
- Various security configurations and credentials
Each of these components requires careful setup and configuration. For builders just getting started or wanting to experiment with Polaris, this overhead can be a significant barrier.
The Solution: A Complete Development Environment
This is why I created an open source starter kit that provides everything needed to get Polaris up and running in a local development environment. The project follows the true spirit of open source collaboration, building upon and integrating with other excellent open source tools in the ecosystem.
The kit automates the setup of:
- A lightweight k3s Kubernetes cluster using k3d
- LocalStack for AWS S3 emulation
- PostgreSQL metastore with proper configurations
- All necessary security credentials and configurations
A key aspect of this starter kit is its comprehensive automation using Ansible. The polaris-forge-setup
directory houses Ansible playbooks that:
- Automate the entire setup process
- Verify if components are ready for use
- Handle catalog setup and configuration
- Provide cleanup capabilities for development iterations
- Enable smooth transitions to higher environments
This automation-first approach serves two purposes:
- Immediate Development: Developers can get started quickly with minimal manual intervention
- Production Readiness: The Ansible scripts serve as a template for scaling to higher environments, making it easier to adapt the setup for staging or production use cases
By keeping everything open source and focusing on community-driven development, we ensure that builders can:
- Learn from the implementation
- Customize for their specific needs
- Contribute improvements back to the community
- Build upon a foundation of trusted open source tools
What is Snowflake OpenCatalog?
Snowflake OpenCatalog is an enterprise-grade implementation and managed service of upstream Polaris, making it incredibly easy to integrate with your existing data stack. By handling the operational complexities of running Polaris at scale, it allows teams to focus on their data applications:
-
Managed Infrastructure: Snowflake handles all operational aspects including:
- Polaris server management and scaling
- Security and access control
- High availability and reliability
- Regular updates and maintenance
-
Enterprise Integration: Seamless connectivity with:
- Snowflake's ecosystem of data services
- Popular query engines and tools
- Existing data governance frameworks
- Enterprise security systems
-
Production-Ready Features:
- Advanced access controls and auditing
- Cross-region and cross-cloud support
- Enterprise-grade SLAs
- Professional support
From Local Development to Enterprise Scale
This starter kit provides an ideal path for builders working with Apache Polaris and considering OpenCatalog for production deployment. By working with the upstream version in this development environment, you:
- Gain hands-on experience with core concepts
- Understand the underlying architecture
- Can prototype and test implementations
- Build expertise that transfers to OpenCatalog
- Have a clear path to production scaling
When you're ready to move to production, the concepts and patterns you've learned here will help you make the most of OpenCatalog's enterprise capabilities while Snowflake handles the operational complexity.
Technical Design Decisions
Why Kubernetes with k3s and k3d?
While Docker Compose is often the go-to choice for local development environments, Apache Polaris's distributed nature benefits significantly from Kubernetes's capabilities. Here's why:
-
Advanced Networking: Kubernetes provides sophisticated networking between components:
- Automatic service discovery and DNS resolution
- Internal load balancing for scalable services
- Ingress management for external access
- Network policies for traffic control
-
Declarative Configuration: Using tools like Helm and Kustomize, we can:
- Maintain separate configurations for different environments
- Version control our infrastructure setup
- Apply consistent changes across deployments
- Manage complex dependencies between services
-
Reliable State Management:
- StatefulSets for databases and stateful services
- PersistentVolumes for durable storage
- Backup and restore capabilities
- Data replication when needed
-
Security and Configuration:
- Native secrets management
- Role-Based Access Control (RBAC)
- ConfigMaps for configuration management
- Service accounts for component authentication
-
Production Readiness:
- Same tools and patterns used in production
- Easy scaling of components
- Built-in monitoring and logging
- Consistent behavior across environments
I specifically chose k3s because it's lightweight and perfect for development environments. Using k3d allows us to run k3s in Docker containers, making it even more convenient for local development. It provides a full Kubernetes experience without the resource overhead of something like minikube.
LocalStack for S3 Integration
While we could have required developers to use actual AWS S3, LocalStack provides a perfect local alternative. It emulates AWS services locally, which means:
- No cloud costs during development
- No need for AWS credentials
- Faster development cycles
- Ability to work offline
PostgreSQL as the Metastore
PostgreSQL was a natural choice for the metastore. It's:
- Well-documented and widely used
- Easy to containerize
- Highly reliable
- Supported out of the box by Polaris
Kustomize for Deployment Management
Kustomize allows us to manage Kubernetes manifests in a clean, declarative way. It makes it easy to:
- Maintain different configurations for different environments
- Override settings without modifying base configurations
- Keep configurations DRY and maintainable
Getting Started
Let me walk you through how to get up and running with this starter kit.
- Ensure you have the prerequisites installed:
# Required tools and their version checks:
# Docker (Desktop or Engine)
docker --version
# Example output: Docker version 24.0.7
# Kubernetes CLI
kubectl version --client
# Example output: Client Version: v1.28.2
# k3d (>= 5.0.0)
k3d version
# Example output: k3d version v5.6.0
# Python (>= 3.11)
python --version
# Example output: Python 3.12.1
# uv (Python packaging tool)
uv --version
# Example output: uv 0.1.12
# Task
task --version
# Example output: Task version: v3.34.1
# LocalStack (>= 3.0.0)
localstack --version
# Example output: 3.0.0
- Sign-up for Localstack
Initial Setup
Clone the repository and set up your environment:
git clone https://github.com/snowflake-labs/polaris-local-forge
cd polaris-local-forge
# Set up environment variables
export PROJECT_HOME="$PWD"
export KUBECONFIG="$PWD/.kube/config"
export K3D_CLUSTER_NAME=polaris-local-forge
export K3S_VERSION=v1.32.1-k3s1
export FEATURES_DIR="$PWD/k8s"
Python Environment Setup
# Install uv
pip install uv
# Set up Python environment
uv python pin 3.12
uv venv
source .venv/bin/activate # On Unix-like systems
uv sync
Deploy the Environment
The setup process is automated through several scripts:
# Generate required sensitive files
$PROJECT_HOME/polaris-forge-setup/prepare.yml
# Create and set up the cluster
$PROJECT_HOME/bin/setup.sh
# Wait for deployments to be ready
$PROJECT_HOME/polaris-forge-setup/cluster_checks.yml --tags namespace,postgresql,localstack
Deploy Polaris
This is where things get interesting - deploying Polaris itself. You have two options for the container images:
Option 1: Use Pre-built Images
Apache Polaris doesn't currently publish official images, but you can use our pre-built images with PostgreSQL dependencies:
docker pull ghcr.io/snowflake-labs/polaris-local-forge/apache-polaris-server-pgsql
docker pull ghcr.io/snowflake-labs/polaris-local-forge/apache-polaris-admin-tool-pgsql
Option 2: Build Images Locally
Alternatively, you can build the images from source:
# Update IMAGE_REGISTRY in Taskfile.yml, then run:
task images
If you choose to build locally, remember to update the image references in:
k8s/polaris/deployment.yaml
k8s/polaris/bootstrap.yaml
k8s/polaris/purge.yaml
Deploy and Verify
Apply the Kubernetes manifests:
# Apply Polaris manifests
kubectl apply -k $PROJECT_HOME/k8s/polaris
# Verify deployments and jobs
$PROJECT_HOME/polaris-forge-setup/cluster_checks.yml --tags polaris
Setting Up Your First Catalog
Before creating your first catalog, configure your AWS environment variables:
export AWS_ENDPOINT_URL=http://localstack.localstack:14566
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
export AWS_REGION=us-east-1
# Run the catalog setup
$PROJECT_HOME/polaris-forge-setup/catalog_setup.yml
Pro Tip: You can customize the default catalog settings by modifying values in polaris-forge-setup/defaults/main.yml. This file contains configurable parameters for your catalog, principal roles, and permissions.
Play with the Catalog
Once your catalog is set up, you can explore its functionality using the provided Jupyter notebook. The notebook notebooks/verify_setup.ipynb walks you through:
- Creating a namespace
- Defining a table
- Inserting sample data
- Verifying data storage in LocalStack
This hands-on exploration helps you understand how Polaris integrates with:
- The PostgreSQL metastore for catalog management
- LocalStack's S3 emulation for data storage
- The overall Apache Iceberg table format structure
You can visually verify your setup by checking the LocalStack console at https://app.localstack.cloud/inst/default/resources/s3/polardb, where you'll see:
- Catalog storage structure
- Metadata files
- Actual data files
Video Walkthrough
For a detailed visual guide of setting up and using this development environment, check out my walkthrough video:
This video demonstrates the entire process from initial setup to running your first queries.
Troubleshooting Tips
If you run into issues, here are some helpful commands for debugging:
# Check Polaris server logs
kubectl logs -f -n polaris deployment/polaris
# Check PostgreSQL logs
kubectl logs -f -n polaris statefulset/postgresql
# Check LocalStack logs
kubectl logs -f -n localstack deployment/localstack
# Check events in the polaris namespace
kubectl get events -n polaris --sort-by='.lastTimestamp'
The Impact: Streamlined Development Experience
With this starter kit, what used to take days of setup and configuration now takes minutes. Builders can focus on creating and experimenting with Polaris rather than wrestling with infrastructure setup.
The kit is open source and available on GitHub. I welcome contributions and feedback from the community. Together, we can make the development experience even better for everyone working with Apache Polaris.
Building should be about creating, not configuring. This starter kit aims to remove the friction from getting started with Apache Polaris, allowing builders to focus on what matters most – creating great applications.
Dont forget to check another project where I used this starter kit https://github.com/kameshsampath/balloon-popper-demo.
Related Projects and Tools
Core Components
- Apache Polaris - Data Catalog and Governance Platform
- PyIceberg - Python library for Apache Iceberg
- LocalStack - AWS Cloud Service Emulator
- k3d - k3s in Docker
- k3s - Lightweight Kubernetes Distribution
- Ansible - Automation Platform
Development Tools
- Docker - Container Platform
- Kubernetes - Container Orchestration
- Helm - Kubernetes Package Manager
- kubectl - Kubernetes CLI
- uv - Python Packaging Tool