Cloud Run is an awesome serverless product provided by Google Cloud which is often a perfect fit to run containerized web services. It offers many advantages such as autoscaling, rolling updates, autorestart, scale to 0 to name just a few. All of it without the hassle of provisioning and managing any cluster !
You would definitely pick this product to host, say, a Python Flask Rest API with the following design:
1- Upload a data file by HTTP POST to a REST endpoint
2- Process the file
3- Insert the data into BigQuery using the client lib
Which is perfectly fine... Unless you want to be able to handle data file bigger than 32 MB !
Indeed, Cloud Run won't let you upload such a big file. Instead, you'll get an error message:
413: Request entity too large
Congratulations, you've just hit the hard size limit of Cloud Run inbound requests.
But don't worry, you can keep using Cloud Run for your service, if you apply the improved design below:
Improved design, with Cloud Storage, Signed Url and PubSub notifications
To work around the limitation, you can design a solution based upon Cloud Storage signed urls:
This time, the file is not directly uploaded to the REST endpoint, but uploaded to cloud storage instead, thus bypassing the 32 MB limitation.
The downside of this process is that the client has to make two requests instead of one. Hence, the whole new sequence goes like this:
1- the client requests a signed url to upload to
2- the webservice, using the Cloud Storage client, generates a signed url and returns it to the client
3- the client uploads the file to the Cloud Storage bucket directly (HTTP PUT to the signed url)
4- at the end of the file upload, the notification OBJECT_FINALIZE
is sent to PubSub
5- the notification is then pushed back to the webservice on Cloud Run through a subscription
6- the webservice reacts to the notification by downloading the file
7- the webservice can then process the file, in the exact same way it did it in the original design
8- likewise, data are inserted into BigQuery
This design is entirely serverless and scales neatly, without any single point of failure. Now, let's see in more details how to implement it.
Make a signed url from Cloud Run
Gotcha! It is required for the Cloud Run service to have the role roles/iam.serviceAccountTokenCreator
in order to be able to generate a signed url. It is not really documented, and if you don't grant it, you get a HTTP error 403 without much more information.
This python code, courtesy of this blog post by Evan Peterson, exposes how to produce signed urls with the Cloud Run webservice's default service account, without requiring the private key file locally (which is big no-no for security reason !)
from typing import Optional
from datetime import timedelta
from google import auth
from google.auth.transport import requests
from google.cloud.storage import Client
def make_signed_upload_url(
bucket: str,
blob: str,
*,
exp: Optional[timedelta] = None,
content_type="application/octet-stream",
min_size=1,
max_size=int(1e6)
):
"""
Compute a GCS signed upload URL without needing a private key file.
Can only be called when a service account is used as the application
default credentials, and when that service account has the proper IAM
roles, like `roles/storage.objectCreator` for the bucket, and
`roles/iam.serviceAccountTokenCreator`.
Source: https://stackoverflow.com/a/64245028
Parameters
----------
bucket : str
Name of the GCS bucket the signed URL will reference.
blob : str
Name of the GCS blob (in `bucket`) the signed URL will reference.
exp : timedelta, optional
Time from now when the signed url will expire.
content_type : str, optional
The required mime type of the data that is uploaded to the generated
signed url.
min_size : int, optional
The minimum size the uploaded file can be, in bytes (inclusive).
If the file is smaller than this, GCS will return a 400 code on upload.
max_size : int, optional
The maximum size the uploaded file can be, in bytes (inclusive).
If the file is larger than this, GCS will return a 400 code on upload.
"""
if exp is None:
exp = timedelta(hours=1)
credentials, project_id = auth.default()
if credentials.token is None:
# Perform a refresh request to populate the access token of the
# current credentials.
credentials.refresh(requests.Request())
client = Client()
bucket = client.get_bucket(bucket)
blob = bucket.blob(blob)
return blob.generate_signed_url(
version="v4",
expiration=exp,
service_account_email=credentials.service_account_email,
access_token=credentials.token,
method="PUT",
content_type=content_type,
headers={"X-Goog-Content-Length-Range": f"{min_size},{max_size}"}
)
Terraform
There is no robust way to do Cloud without infra as code, and Terraform is the perfect tool to manage your Cloud resources.
Here are the Terraform fragments for deploying this design:
# Resources to handle big data files (>32 Mb)
# These files are uploaded to a special bucket with notifications
provider "google-beta" {
project = <your GCP project name>
}
data "google_project" "default" {
provider = google-beta
}
resource "google_storage_bucket" "bigframes_bucket" {
project = <your GCP project name>
name = "upload-big-files"
location = "EU"
cors {
origin = ["*"]
method = ["*"]
response_header = [
"Content-Type",
"Access-Control-Allow-Origin",
"X-Goog-Content-Length-Range"
]
max_age_seconds = 3600
}
}
resource "google_service_account" "default" {
provider = google-beta
account_id = "sa-webservice"
}
resource "google_storage_bucket_iam_member" "bigframes_admin" {
bucket = google_storage_bucket.bigframes_bucket.name
role = "roles/storage.admin"
member = "serviceAccount:${google_service_account.default.email}"
}
# required to generate a signed url
resource "google_service_account_iam_member" "tokencreator" {
provider = google-beta
service_account_id = google_service_account.default.name
role = "roles/iam.serviceAccountTokenCreator"
member = "serviceAccount:${google_service_account.default.email}"
}
# upload topic for notifications
resource "google_pubsub_topic" "bigframes_topic" {
provider = google-beta
name = "topic-bigframes"
}
# upload deadletter topic for failed notifications
resource "google_pubsub_topic" "bigframes_topic_deadletter" {
provider = google-beta
name = "topic-bigframesdeadletter"
}
# add frame upload notifications on the bucket
resource "google_storage_notification" "bigframes_notification" {
provider = google-beta
bucket = google_storage_bucket.bigframes_bucket.name
payload_format = "JSON_API_V1"
topic = google_pubsub_topic.bigframes_topic.id
event_types = ["OBJECT_FINALIZE"]
depends_on = [google_pubsub_topic_iam_binding.bigframes_binding]
}
# required for storage notifications
# seriously, Google, this should be by default !
resource "google_pubsub_topic_iam_binding" "bigframes_binding" {
topic = google_pubsub_topic.bigframes_topic.id
role = "roles/pubsub.publisher"
members = ["serviceAccount:service-${data.google_project.default.number}@gs-project-accounts.iam.gserviceaccount.com"]
}
# frame upload main sub
resource "google_pubsub_subscription" "bigframes_sub" {
provider = google-beta
name = "sub-bigframes"
topic = google_pubsub_topic.bigframes_topic.id
push_config {
push_endpoint = <URL where pushed notification are POST-ed>
}
dead_letter_policy {
dead_letter_topic = google_pubsub_topic.bigframes_topic_deadletter.id
}
}
# frame upload deadletter subscription
resource "google_pubsub_subscription" "bigframes_sub_deadletter" {
provider = google-beta
name = "sub-bigframesdeadletter"
topic = google_pubsub_topic.bigframes_topic_deadletter.id
ack_deadline_seconds = 600
push_config {
push_endpoint = <URL where pushed notification are POST-ed>
}
}
Just terraform deploy
it !
How to upload
One final gotcha: to upload to Cloud Storage with the signed url, you must set an additional header in the PUT
request:
X-Goog-Content-Length-Range: <min size>,<max size>
where min size
and max size
match min_size
and max_size
of the make_signed_upload_url()
method above.
Conclusion
Have you experienced this design ? How would you improve it ? Please let me know in the comments.
Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.
Design schemas made with Excalidraw and the GCP Icons library by @clementbosc
Cover photo by joel herzog on Unsplash