tl;dr;
Migrating image data from Cloud SQL to GCS significantly improved database performance and reduced storage costs. By automating the process with Kubernetes Jobs and GitHub Actions, I ensured a repeatable and secure migration path.
Key Takeaways:
- Use Appropriate Storage Solutions: Databases are not ideal for large binary data.
- Automate with CI/CD: Reduces errors and streamlines deployments.
- Secure Authentication: Utilize Workload Identity to avoid managing service account keys.
- Cleanup After One-Time Tasks: Remove temporary scripts and resources to maintain a clean codebase.
In a recent project, I faced a challenge with efficiently storing image data. The service I was working on is built on Google Cloud Platform (GCP), utilizing Cloud SQL for relational data and Google Cloud Storage (GCS) for object storage. Initially, image data in base64 format was stored directly within the MessageTable
of Cloud SQL. However, as the volume of images grew, I noticed increased database load and potential performance degradation.
To address this, I decided to migrate the base64-encoded images from the database to GCS. This blog post outlines the technical steps I took to achieve this migration using a Kubernetes Job, TypeScript scripts, and GitHub Actions.
The problem with storing images in the database
Storing large binary data like images in a relational database is suboptimal. Databases are designed for structured data and transactions, not for handling large blobs of data. Doing so can:
- Increase backup sizes and times: Large data sizes make backups slower and consume more storage.
- Slow down query performance: Large blobs can degrade the performance of queries, especially if the images are retrieved frequently.
- Consume more memory and storage resources: Databases are not optimized for storing large files, leading to inefficient resource usage.
Learn more on this blog.
Solution Overview
To solve this, I planned to:
- Extract the base64 images from Cloud SQL.
- Upload them to GCS, which is optimized for storing large objects.
- Update the application to reference the images from GCS instead of the database.
This process involved creating a TypeScript script to handle the migration, setting up a Kubernetes Job to execute the script within a private network, and automating the deployment with GitHub Actions.
Preparing the Migration Script
I used TypeScript and the Drizzle ORM for database interactions. The script performs the following steps:
- Retrieve messages containing base64 image data from the database.
- Decode the base64 string into a binary buffer.
- Determine the image file type using the buffer.
- Upload the image to GCS under a structured path using the message ID.
- Handle errors and log them appropriately.
Migration Script (task.ts
):
const migrateDBToStorage = async () => {
const messages = await db.select().from(MessageTable).execute();
for (const message of messages) {
const base64Image = message.base64ImageData.replace(/^data:image\/\w+;base64,/, '');
const buffer = Buffer.from(base64Image, 'base64');
const fileType = await fileTypeFromBuffer(buffer);
if (!fileType) {
logger.error(`Failed to get file type for message ID: ${message.id}`);
continue;
}
const uploadResult = await storageClient.uploadBlobsInBatch([
{
path: `messages/${message.id}.${fileType.ext}`,
content: buffer,
},
]);
if (!uploadResult.isSuccess) {
logger.error(`Failed to upload image for message ID: ${message.id}`, uploadResult.getError());
}
}
};
Key Points:
- Data Extraction: Retrieves all messages with image data.
- Data Transformation: Removes the base64 header and decodes the string.
-
File Type Detection: Uses
fileTypeFromBuffer
to determine the file extension. -
Uploading to GCS: Utilizes
storageClient.uploadBlobsInBatch
for efficient uploading.
Setting Up the Kubernetes Job
To execute the script securely within the private network, I used a Kubernetes Job on Google Kubernetes Engine (GKE). I utilized kustomize to manage environment-specific configurations while keeping a common base.
Directory Structure:
.
├── base
│ ├── external-secret.yaml
│ ├── job.yaml
│ └── kustomization.yaml
└── overlays
├── dev
│ ├── config.properties
│ ├── external-secret.yaml
│ ├── job.yaml
│ ├── kustomization.yaml
│ └── serviceaccount.yaml
├── stage
└── prod
Job Definition (job.yaml
):
apiVersion: batch/v1
kind: Job
metadata:
name: db-to-storage-migration
spec:
template:
spec:
containers:
- name: db-to-storage-migration
image: <artifact-registry-image>
command:
- "sh"
- "-c"
- "bun run /app/index.js migrate-db-to-storage"
restartPolicy: Never
Explanation:
- apiVersion & kind: Defines this resource as a Kubernetes Job.
-
metadata.name: Names the job
db-to-storage-migration
. -
spec.template.spec.containers: Specifies the container to run.
- image: Points to the container image stored in Artifact Registry.
-
command: Executes the migration script using
bun
.
-
restartPolicy: Set to
Never
to avoid re-running upon completion.
Managing Service Accounts and Permissions
To access Cloud SQL and GCS from the Kubernetes pods, I set up the necessary permissions:
-
Cloud SQL: Assigned
roles/cloudsql.client
to allow connections. -
GCS: Assigned
roles/storage.objectCreator
to enable uploading objects.
I used Workload Identity to map a Kubernetes Service Account to a Google Service Account with these permissions.
Service Account Configuration (serviceaccount.yaml
):
apiVersion: v1
kind: ServiceAccount
metadata:
name: db-to-storage-migration-sa
annotations:
iam.gke.io/gcp-service-account: <gcp-service-account-email>
Steps Taken:
- Created a Google Service Account with the required IAM roles.
- Annotated the Kubernetes Service Account to link it with the Google Service Account.
- Enabled Workload Identity to allow pods to authenticate as the Google Service Account without using service account keys.
Automating Deployment with GitHub Actions
To apply the Kubernetes Job on GKE, I utilized GitHub Actions. This approach:
- Reduces Manual Errors: Automation prevents mistakes that can occur during manual deployment.
- Leverages Existing CI/CD Pipelines: Integrates smoothly with current workflows.
- Enhances Security: Uses Workload Identity and OIDC for authentication, avoiding service account keys.
GitHub Actions Workflow (db-to-storage-migration.yaml
):
name: Database to Storage Migration
on:
workflow_dispatch:
env:
WORKLOAD_IDENTITY_PROVIDER: ${{ secrets.WORKLOAD_IDENTITY_PROVIDER }}
GHA_SERVICE_ACCOUNT: ${{ secrets.GHA_SERVICE_ACCOUNT }}
REGION: <region>
GCP_PROJECT: <project-id>
GKE_CLUSTER: <cluster-name>
jobs:
db-to-storage-migration:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
with:
ref: ${{ github.ref_name || 'development' }}
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ env.WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ env.GHA_SERVICE_ACCOUNT }}
- uses: google-github-actions/setup-gcloud@v2
with:
project_id: ${{ env.GCP_PROJECT }}
- run: |
gcloud components install gke-gcloud-auth-plugin
gcloud container clusters get-credentials $GKE_CLUSTER --region $REGION --project $GCP_PROJECT
kubectl apply -k deployment/k8s/manifests/db-to-storage-migration/overlays/dev
Workflow Breakdown:
-
Trigger: Manually via
workflow_dispatch
. - Environment Variables: Securely injected via GitHub Secrets.
-
Jobs:
-
Authentication: Uses
google-github-actions/auth
with Workload Identity. -
Setup Cloud SDK: Ensures
gcloud
CLI is available. - Execute Commands:
- Installs the GKE auth plugin.
- Fetches cluster credentials.
- Applies the Kubernetes manifests using
kubectl apply -k
.
-
Authentication: Uses
Execution and Cleanup
After setting everything up:
- Ran the GitHub Actions workflow for each environment (dev, stage, prod).
-
Migration Script Execution:
- The Kubernetes Job pulled the container image.
- Executed the migration script inside the cluster.
- Logs were collected for monitoring and troubleshooting.
-
Post-Migration:
- Updated the application code to reference images from GCS.
- Tested thoroughly to ensure functionality.
-
Cleanup:
- Deleted the Kubernetes Job and related resources.
- Removed the migration script from the codebase to prevent unintended reuse.
If you found this guide helpful and are dealing with similar challenges, don't hesitate to reach out to me on X. For more tech insights and updates, consider following me on GitHub. Let's innovate together!