AWS Advanced: The Quota Monitor Review

Warren Parad - Jan 9 - - Dev Community

$78,641.25

Per Month.

That's the predicted amount of running the official quota monitor released by AWS for ~1000 AWS accounts in your organization. For us at Authress, and those that have AWS accounts per customer or spin up extra ones per team/product/service, this cost could actually be significantly higher.

Obviously this isn't an entirely fair interpretation of the cost analytics. But it is pretty ridiculous that the cost here scales like this. We would full expect a near zero cost for running a Quota Monitor.

That is, when we need additional resources, AWS should provide a free way for us to know that we want to pay AWS more. Running into production problems by default is always a pit of failure. I call it a pit of failure, because by default we won't know that we are about to run out of a critical resource we need to make our product work. The right solution is for AWS to alert us to our contacts on file for the account, when there is a problem, let us burst above the usage and then continue the conversation. (At least that is what we do for our customers)

But AWS doesn't have that, so let's attempt the next best thing. Building a better version.

The baseline

The original proposed architecture looks something like this:

AWS Quota Monitor

So let's dive into that. It looks a bit complicated. If you have been looking at architectural diagrams for as long as I have, then you know to trust this instinct. It is indeed far too overly complicated. And most of the cost from our above calculated $78,641.25, comes from the use of CloudWatch metrics. Metrics should be avoided at all costs since they have exorbitant pricing. This is something I previously heavily reviewed in our quest for the perfect serverless metrics solution.

The core requirements of what we need are:

  1. Call the Service Quota API and check where the quotas are at.
  2. Alert on this problem

One of the problems with the model proposed by the AWS Quota Monitor, is that it is done in isolation. That is, it doesn't rely on other alerting or monitoring infrastructure you might already be utilizing. In practice for explaining new concepts, this is helpful. But for actually building up your AWS account to a the perfectly secure and well lubricated machine, let's utilize our existing infrastructure. Having followed the logging infrastructure already put in place we have strong strategy for cross-account and region alerting:

Cross Account and Region Alerting

That means all we need to do is deploy some piece of technology which will call the Service Quota API, log a message to CloudWatch Logs, and have a CloudWatch subscription set up to forward those messages to the appropriate location. Simple right?.

Step Zero: Architecture

Rather than the original architecture which requires passing all the data from all the usage from the spoke AWS accounts to the Hub. We will convert this to have the spoke AWS accounts in the organization self report. Applications in the spoke AWS accounts already self report whenever there is a production problem, and Quota Management isn't something that has security implications nor requires oversight from our AWS Monitoring or AWS Logging Org-level accounts.

By pushing the responsibility of alerting into the spoke AWS accounts, we can control the levels of when to alert, make decisions in real time about the potential issues with the current quotas, and utilize the share infrastructure AWS accounts in our Org only when necessary.

This also has the huge benefit of self-cleanup. When an AWS account is deactivated and finally deleted, there is no historical data that is left anywhere else, no existing cross account connections, no unnecessary configuration. Every account cleans up its own configuration and data automatically. The historical information is saved, but if the account isn't being used anymore we don't need to care about whether we are hitting a quota in it.

A few details of importance:

  • Usually we would use the AWS Trusted Advisor to monitor quota limits, but not every one can benefit from this service as it requires an AWS Business Support plan. Which if you spend ~$100k it will cost about $6900 per month. Or let's call it 7% of your spend. That's a lot.
  • Not all the checks are there and sometimes we might not be interested in exactly the values that have been set but want to have our own values.

The major problem here is that the AWS Solutions' Quota Monitor only utilizes data from CloudWatch Metrics. Which means only quotas that log to CloudWatch can be used. For example, I'm looking at one of our AWS Accounts, and there are only 194 metrics in the region.

Many Quotas that are interesting don't have an metric. Take Route 53. Wouldn't it be great if you knew how close you were to the limit for creating new Route 53 records.

{
            "ServiceCode": "route53",
            "ServiceName": "Amazon Route 53",
            "QuotaArn": "arn:aws:servicequotas:::route53/L-E209CC9F",
            "QuotaCode": "L-E209CC9F",
            "QuotaName": "Records per hosted zone",
            "Value": 10000.0,
            "Unit": "None",
            "Adjustable": true,
            "GlobalQuota": true
        }
Enter fullscreen mode Exit fullscreen mode

But the problem is that Route53 doesn't log a metric for record count.

aws service-quotas list-aws-default-service-quotas \
  --service-code route53 --region us-east-1 \
  --query 'Quotas[?UsageMetric.MetricNamespace==`AWS/Usage`]'
Enter fullscreen mode Exit fullscreen mode

Running this CLI command will tell you in the region which quotas are being tracked. The result from this command is only:

[
    {
        "ServiceCode": "route53",
        "ServiceName": "Amazon Route 53",
        "QuotaArn": "arn:aws:servicequotas:::route53/L-F767CB15",
        "QuotaCode": "L-F767CB15",
        "QuotaName": "Domain count limit",
        "Value": 20.0,
        "Unit": "None",
        "Adjustable": true,
        "GlobalQuota": true,
        "UsageMetric": {
            "MetricNamespace": "AWS/Usage",
            "MetricName": "ResourceCount",
            "MetricDimensions": {
                "Class": "None",
                "Resource": "DomainCount",
                "Service": "Route 53 Domains",
                "Type": "Resource"
            },
            "MetricStatisticRecommendation": "Maximum"
        },
        "Period": {
            "PeriodValue": 5,
            "PeriodUnit": "MINUTE"
        }
    }
]
Enter fullscreen mode Exit fullscreen mode

We recently found we had to request increased RPS for our API Gateways for Authress. So let's check to see what the Quota Monitor would have found:

aws service-quotas list-aws-default-service-quotas \
  --service-code apigateway --region eu-west-1 \
  --query 'Quotas[?UsageMetric.MetricNamespace==`AWS/Usage`]'
Enter fullscreen mode Exit fullscreen mode
[]
Enter fullscreen mode Exit fullscreen mode

Well RIP 🪦. The quota monitor appears to be completely useless. Even if we attempted to utilize different metric namespaces ourselves, the only Lambda related one is:

[
    {
        "ServiceCode": "lambda",
        "ServiceName": "AWS Lambda",
        "QuotaArn": "arn:aws:servicequotas:eu-west-1::lambda/L-B99A9384",
        "QuotaCode": "L-B99A9384",
        "QuotaName": "Concurrent executions",
        "Value": 1000.0,
        "Unit": "None",
        "Adjustable": true,
        "GlobalQuota": false,
        "UsageMetric": {
            "MetricNamespace": "AWS/Lambda",
            "MetricName": "ConcurrentExecutions",
            "MetricDimensions": {},
            "MetricStatisticRecommendation": "Maximum"
        }
    }
]
Enter fullscreen mode Exit fullscreen mode

Well that's at least a bit better, but still we are missing the other 19 Quotas for Lambda. Admittedly many of these are not adjustable, so let's limit our query just to the ones which in fact are:

aws service-quotas list-aws-default-service-quotas \
  --service-code lambda --region eu-west-1 \
  --query 'Quotas[?Adjustable==`true`].QuotaName'
Enter fullscreen mode Exit fullscreen mode
[
    "Concurrent executions",
    "Function and layer storage",
    "Elastic network interfaces per VPC"
]
Enter fullscreen mode Exit fullscreen mode

So three, and only one more than we found with a Usage Pattern.

So right now we are in a pretty bad place. The Quota Monitor is dependent on data that will never exist in most situations.

The disappointment

Okay, sorry for the lack of any real solution. While this started out a journey to define a quota monitor that anyone can use, based on how the original monitor was built we know that there isn't an easy out of the box solution. That's because The Quota Monitor AWS Solution is broken by design. It only reports on things that are being tracked, and AWS is only tracking usage when it is easy. When it is hard, say, Max(RPS) for API Gateway in the last 5 minutes, there is no Quota Service unified usage data. And without that data being tracked there is no way to alert on it. If we want to know if we are going to hit a limit with one of the services we actually use, we needed to explicitly build a solution that tracks that specific thing.

I'm currently working through setting up real monitors for our infarstructure, and for each one we'll have a dedicated post walking through how it's done. So stay tuned!


Curious about this and want to chat about the things I've built, message me in the community:

Join the community

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .