These days, cloud efficiency and cost savings are top of mind for many organizations. The current economic conditions aside, it’s always a good opportunity to support efforts to use the cloud efficiently. Besides saving money, the same levers that drive efficiency very often directly support scalability, reliability and sustainability.
Data Warehouses were historically statically provisioned, fixed cost systems. The good news is that advancements in cloud native data warehouse platforms have enabled us to maximize efficiency and cost much like our other engineered systems.
There are obviously a lot of different ways that we can approach efficiency. Optimizing a system like Redshift should include both infrastructure configuration as well as what “runs on” the platform. The latter may include various improvements such as optimizing queries, table structures and materialization patterns, or maybe even moving some of the workloads outside the data warehouse platform (ex. moving big crunches to Elastic Map Reduce). For this article I’ll focus on the infrastructure layer, using only insights from Cost Explorer and assume what runs on Redshift is fixed.
As many of you know, Redshift RA3 instance types decouple compute and storage. Furthermore they have built elasticity into compute, allowing you to handle spikes and increases in workload by leveraging features such as Concurrency Scaling and Serverless Endpoints. These features, along with the ability to blend provisioned with serverless and elastic resources is why Redshift delivers such excellent price-performance.
Cost Explorer to the rescue!
As the old saying goes “you can't improve what you don't measure”. So the first step in figuring out whether your Redshift infrastructure is optimal is reviewing Cost Explorer. This is not necessarily a trivial task as there are many components to the actual billing, with different timing, and unfortunately some cryptic abbreviations.
Within cost explorer change the report parameters to Granularity: Monthly (this is a good place to start), Dimension: Usage type, and set a Filter on Service: Redshift. You will then end up with a report that looks something like below. This example infrastructure configuration is really useful, as it demonstrates nearly all the components you may see with a standard RA3 and serverless deployment. I’ll now walk through the line items, and identify where there may be some efficiency opportunities.
Let’s get familiar with the abbreviations and their meanings. All the pricing details can be found here, but I’ll try to summarize the important bits.
- USE-ServerlessUsage - The cost of the used Redshift Processing Units (this includes the whatever base capacity you have reserved plus the associated actual billing above that capacity)
- HeavyUsage - Redshift reserved instance cost (be sure to select the 1st of the month in your date selection to pick up this line item)
- Node - Redshift on-demand usage
- RMS - Redshift managed storage, storage cost in GB hours
- CS - Concurrency Scaling, you accrue up to one hour of concurrency scaling per day, usage beyond this is billed per-second on-demand
- PaidSnapshots - Backups, necessary of course but definitely not free
- USE-DataScanned - Redshift spectrum usage, querying data that exists in S3 or other external sources
The Findings
For context we are looking at 3 pieces of Redshift infrastructure: a reserved ra3.4xlarge cluster, an on-demand ra3.xlplus cluster, and a Redshift Serverless endpoint.
Excessive Concurrency Scaling
Our Concurrency Scaling (CS:ra3.4xlarge) is approaching the cost of our reserved cluster. This cluster is obviously blowing through the daily budget, and relying heavily on on-demand capacity to complete its computing tasks. First off, let’s acknowledge how cool this is - we are running a cluster at or above the redline and still completing the necessary work and serving our end users. A cost effective solution here is to offset the on-demand pricing with reserved instances. A good experiment would be to add an on-demand node to the cluster and observe the reduction in Concurrency Scaling and overall cluster CPU. If the calculated price profile is favorable consider reserving nodes.
Redshift Serverless Usage
The Redshift Serverless endpoint (USE1-ServerlessUsage) costs have eclipsed the costs of our reserved cluster. There may be good reasons for this, and it’s possible that this is the most cost effective way of handling this workload. The first thing to do is to check the Base RPU settings for this cluster, making sure that we haven’t staked a baseline commitment that is too high, and does not reflect the usage and minimum requirements of the cluster. Depending on the actual RPU units consumed and usage patterns this may mean that a reserved provisioned cluster might be more effective. A good common sense rule, at this point in time, is that a highly utilized cluster will be more cost effective staying provisioned. Note that the math here is a little tricky and there is no official guidance just yet, but let’s use 60% avg cluster CPU as a good case to keep provisioned. I suggest reading this section of the documentation on monitoring cost and usage.
On Demand Nodes
The ra3.xlplus cluster is contributing a fair amount of billing given its performance capacity. It’s easy to fall into the trap of running on-demand too long, especially when there are pending changes to the environment. Another common sense rule - if you plan on running this infrastructure more than 6 months you probably want to do a 1 year reserve, and greater than 18 months do a 3 year. We humans are pretty poor planners in general, especially when making guesses with uncertainty. In my experience the rules above have always been net favorable. Even if your guess is correct you’ll be pretty close to the break even point of the reserve.
Snapshots
The snapshot (Redshift:PaidSnapshots) costs are a little high, but reasonable. However, it is still worth an investigation. Start by reviewing your retention policy, and make sure that it complies with your organization's service levels and policies. Also be sure to page back to the early history and make sure you're not permanently safekeeping a large number of final snapshots. I’ve seen large buildups of these from misconfigured CICD pipelines or programmatic restores to non-prod.
In Closing
Optimizing infrastructure can be both a fun and rewarding exercise. It can be like playing Sherlock Holmes with Cost Explorer as your Watson. Based solely on these cost explorer findings there is a relatively easy 10 to 20% cost savings through infrastructure configurations. Although this post outlines high level infrastructure review, there is definitely a lot more digging to do, especially with what “runs on” Redshift.
Go forth and be frugal.