How to overcome Cloud Run's 32MB request limit

matthieucham - Apr 6 '22 - - Dev Community

Cloud Run is an awesome serverless product provided by Google Cloud which is often a perfect fit to run containerized web services. It offers many advantages such as autoscaling, rolling updates, autorestart, scale to 0 to name just a few. All of it without the hassle of provisioning and managing any cluster !

You would definitely pick this product to host, say, a Python Flask Rest API with the following design:

Image description
1- Upload a data file by HTTP POST to a REST endpoint
2- Process the file
3- Insert the data into BigQuery using the client lib

Which is perfectly fine... Unless you want to be able to handle data file bigger than 32 MB !

Indeed, Cloud Run won't let you upload such a big file. Instead, you'll get an error message:

413: Request entity too large

Congratulations, you've just hit the hard size limit of Cloud Run inbound requests.
But don't worry, you can keep using Cloud Run for your service, if you apply the improved design below:

Improved design, with Cloud Storage, Signed Url and PubSub notifications

To work around the limitation, you can design a solution based upon Cloud Storage signed urls:

Image description

This time, the file is not directly uploaded to the REST endpoint, but uploaded to cloud storage instead, thus bypassing the 32 MB limitation.

The downside of this process is that the client has to make two requests instead of one. Hence, the whole new sequence goes like this:

1- the client requests a signed url to upload to
2- the webservice, using the Cloud Storage client, generates a signed url and returns it to the client
3- the client uploads the file to the Cloud Storage bucket directly (HTTP PUT to the signed url)
4- at the end of the file upload, the notification OBJECT_FINALIZE is sent to PubSub
5- the notification is then pushed back to the webservice on Cloud Run through a subscription
6- the webservice reacts to the notification by downloading the file
7- the webservice can then process the file, in the exact same way it did it in the original design
8- likewise, data are inserted into BigQuery

This design is entirely serverless and scales neatly, without any single point of failure. Now, let's see in more details how to implement it.

Make a signed url from Cloud Run

Gotcha! It is required for the Cloud Run service to have the role roles/iam.serviceAccountTokenCreator in order to be able to generate a signed url. It is not really documented, and if you don't grant it, you get a HTTP error 403 without much more information.

This python code, courtesy of this blog post by Evan Peterson, exposes how to produce signed urls with the Cloud Run webservice's default service account, without requiring the private key file locally (which is big no-no for security reason !)



from typing import Optional
from datetime import timedelta

from google import auth
from google.auth.transport import requests
from google.cloud.storage import Client


def make_signed_upload_url(
    bucket: str,
    blob: str,
    *,
    exp: Optional[timedelta] = None,
    content_type="application/octet-stream",
    min_size=1,
    max_size=int(1e6)
):
    """
    Compute a GCS signed upload URL without needing a private key file.
    Can only be called when a service account is used as the application
    default credentials, and when that service account has the proper IAM
    roles, like `roles/storage.objectCreator` for the bucket, and
    `roles/iam.serviceAccountTokenCreator`.
    Source: https://stackoverflow.com/a/64245028

    Parameters
    ----------
    bucket : str
        Name of the GCS bucket the signed URL will reference.
    blob : str
        Name of the GCS blob (in `bucket`) the signed URL will reference.
    exp : timedelta, optional
        Time from now when the signed url will expire.
    content_type : str, optional
        The required mime type of the data that is uploaded to the generated
        signed url.
    min_size : int, optional
        The minimum size the uploaded file can be, in bytes (inclusive).
        If the file is smaller than this, GCS will return a 400 code on upload.
    max_size : int, optional
        The maximum size the uploaded file can be, in bytes (inclusive).
        If the file is larger than this, GCS will return a 400 code on upload.
    """
    if exp is None:
        exp = timedelta(hours=1)
    credentials, project_id = auth.default()
    if credentials.token is None:
        # Perform a refresh request to populate the access token of the
        # current credentials.
        credentials.refresh(requests.Request())
    client = Client()
    bucket = client.get_bucket(bucket)
    blob = bucket.blob(blob)
    return blob.generate_signed_url(
        version="v4",
        expiration=exp,
        service_account_email=credentials.service_account_email,
        access_token=credentials.token,
        method="PUT",
        content_type=content_type,
        headers={"X-Goog-Content-Length-Range": f"{min_size},{max_size}"}
    )


Enter fullscreen mode Exit fullscreen mode

Terraform

There is no robust way to do Cloud without infra as code, and Terraform is the perfect tool to manage your Cloud resources.

Image description

Here are the Terraform fragments for deploying this design:



# Resources to handle big data files (>32 Mb)
# These files are uploaded to a special bucket with notifications

provider "google-beta" {
  project = <your GCP project name>
}

data "google_project" "default" {
  provider = google-beta
}

resource "google_storage_bucket" "bigframes_bucket" {
  project  = <your GCP project name>
  name     = "upload-big-files"
  location = "EU"

  cors {
    origin = ["*"]
    method = ["*"]
    response_header = [
      "Content-Type",
      "Access-Control-Allow-Origin",
      "X-Goog-Content-Length-Range"
    ]
    max_age_seconds = 3600
  }
}

resource "google_service_account" "default" {
  provider     = google-beta
  account_id   = "sa-webservice"
}

resource "google_storage_bucket_iam_member" "bigframes_admin" {
  bucket = google_storage_bucket.bigframes_bucket.name
  role   = "roles/storage.admin"
  member = "serviceAccount:${google_service_account.default.email}"
}

# required to generate a signed url
resource "google_service_account_iam_member" "tokencreator" {
  provider           = google-beta
  service_account_id = google_service_account.default.name
  role               = "roles/iam.serviceAccountTokenCreator"
  member             = "serviceAccount:${google_service_account.default.email}"
}

# upload topic for notifications
resource "google_pubsub_topic" "bigframes_topic" {
  provider = google-beta
  name     = "topic-bigframes"
}

# upload deadletter topic for failed notifications
resource "google_pubsub_topic" "bigframes_topic_deadletter" {
  provider = google-beta
  name     = "topic-bigframesdeadletter"
}

# add frame upload notifications on the bucket
resource "google_storage_notification" "bigframes_notification" {
  provider       = google-beta
  bucket         = google_storage_bucket.bigframes_bucket.name
  payload_format = "JSON_API_V1"
  topic          = google_pubsub_topic.bigframes_topic.id
  event_types    = ["OBJECT_FINALIZE"]
  depends_on = [google_pubsub_topic_iam_binding.bigframes_binding]
}

# required for storage notifications
# seriously, Google, this should be by default !
resource "google_pubsub_topic_iam_binding" "bigframes_binding" {
  topic   = google_pubsub_topic.bigframes_topic.id
  role    = "roles/pubsub.publisher"
  members = ["serviceAccount:service-${data.google_project.default.number}@gs-project-accounts.iam.gserviceaccount.com"]
}

# frame upload main sub
resource "google_pubsub_subscription" "bigframes_sub" {
  provider = google-beta
  name     = "sub-bigframes"
  topic    = google_pubsub_topic.bigframes_topic.id

  push_config {
    push_endpoint = <URL where pushed notification are POST-ed>
  }
  dead_letter_policy {
    dead_letter_topic = google_pubsub_topic.bigframes_topic_deadletter.id
  }
}

# frame upload deadletter subscription
resource "google_pubsub_subscription" "bigframes_sub_deadletter" {
  provider             = google-beta
  name                 = "sub-bigframesdeadletter"
  topic                = google_pubsub_topic.bigframes_topic_deadletter.id
  ack_deadline_seconds = 600

  push_config {
    push_endpoint = <URL where pushed notification are POST-ed>
  }
}



Enter fullscreen mode Exit fullscreen mode

Just terraform deploy it !

How to upload

One final gotcha: to upload to Cloud Storage with the signed url, you must set an additional header in the PUT request:
X-Goog-Content-Length-Range: <min size>,<max size>
where min size and max size match min_size and max_size of the make_signed_upload_url() method above.

Conclusion

Have you experienced this design ? How would you improve it ? Please let me know in the comments.

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.


Design schemas made with Excalidraw and the GCP Icons library by @clementbosc
Cover photo by joel herzog on Unsplash

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .