TL;DR: Sorry for the clickbait title. Never guess from the name of a performance metric. The workload was inserting 100,000 rows per second, in bulk, at 100 transactions per second, and no database does a disk write I/O for each row for bulk inserts.
140k Write IOPS at less than 1 ms latency looks good for a database. However, it's important to be cautious. What exactly do Amazon Aurora and CloudWatch mean by "Write IOPS" in this context? There's a well-known saying attributed to Phil Karlton that is very relevant to storage observability metrics:
There are only two hard things in computer science: cache invalidation and naming things.
For instance, you might check the WriteIOPS
chart in AWS and assume you are achieving a hundred thousand disk I/O operations per second. This is wrong and just due to poor naming conventions that do not accurately reflect what is being measured. I got this from a small db.t3.medium instance that was supposed to provide a 2000 IOPS baseline with 11800 Max IOPS:
We are not counting the same I/O operations there.
Fortunately, if you click on the info box (ℹ️), you can get its description (also visible in Amazon CloudWatch metrics for Amazon Aurora).
Does it make it clear what it measures? Not at all because it uses terms specific to an infrastructure that is not publicly explained. What is a "log record" in PostgreSQL-compatible Aurora? Is it the Write-Ahead Logging (WAL) that we know in PostgreSQL? Is it a filesystem log record like in ZFS?
The only thing the description makes clear is that WriteIOPS is probably not what you think it is ;)
We will see that a similar metric is available in Performance Insight as os.diskIO.auroraStorage.writeIOsPS
, documented in Performance Insights counter metrics:
It's "The number of write operations per second," and the type "Disk IO" suggests that it is related to the physical writes to a disk. But it is not. Never trust the documentation. Test it.
The WriteIOPS chart above shows two tests I've run. Here are the details.
A single table without Indexes
For the first run, I created a simple table without any index.
create table demo ( id bigserial, value int );
I ran transactions that bulk insert a thousand rows, continuously, from fifty sessions using PgBench:
postgres=> \! pgbench -c 50 -T 900 -P 90 -nf /dev/stdin <<<'insert into demo(value) select generate_series(1,1000)'
pgbench (16.2, server 16.4)
progress: 90.0 s, 83.6 tps, lat 511.439 ms stddev 215.206, 0 failed
progress: 180.0 s, 100.4 tps, lat 500.255 ms stddev 218.979, 0 failed
progress: 270.0 s, 100.4 tps, lat 495.589 ms stddev 240.228, 0 failed
progress: 360.0 s, 115.4 tps, lat 435.752 ms stddev 219.620, 0 failed
progress: 450.0 s, 128.7 tps, lat 388.603 ms stddev 157.328, 0 failed
progress: 540.0 s, 134.0 tps, lat 372.766 ms stddev 149.400, 0 failed
progress: 630.0 s, 132.4 tps, lat 377.225 ms stddev 161.455, 0 failed
progress: 720.0 s, 134.0 tps, lat 374.112 ms stddev 155.527, 0 failed
progress: 810.0 s, 130.8 tps, lat 380.153 ms stddev 155.590, 0 failed
progress: 900.0 s, 114.3 tps, lat 439.524 ms stddev 218.516, 0 failed
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 50
number of threads: 1
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 105700
number of failed transactions: 0 (0.000%)
latency average = 419.932 ms
latency stddev = 195.071 ms
initial connection time = 12377.544 ms
tps = 119.016746 (without initial connection time)
That ran around 120 transactions per second, inserting 120k rows per second.
The pressure from 50 connections to a small single instance shows a bottleneck in the shared buffers:
Performance Insights provides more metrics, like the IO operations per second, clearly labeled as Disk IO:
This matches precisely the WriteIOPS from CloudWatch. Those write IO operations are simply the number of inserted rows.
Same table with two indexes
To understand better, I've run a second example with the same table but two additional indexes: a primary key and a secondary index:
drop table demo;
create table demo ( id bigserial primary key, value int );
create index on demo(value);
As expected, the throughput is lower because there's more work to do to maintain the indexes:
postgres=> \! pgbench -c 50 -T 900 -P 90 -nf /dev/stdin <<<'insert into demo(value) select generate_series(1,1000)'
pgbench (16.2, server 16.4)
progress: 90.0 s, 24.9 tps, lat 1684.029 ms stddev 673.718, 0 failed
progress: 180.0 s, 29.5 tps, lat 1714.561 ms stddev 894.218, 0 failed
progress: 270.0 s, 36.7 tps, lat 1360.118 ms stddev 557.919, 0 failed
progress: 360.0 s, 34.6 tps, lat 1450.471 ms stddev 657.549, 0 failed
progress: 450.0 s, 31.2 tps, lat 1594.461 ms stddev 603.998, 0 failed
progress: 540.0 s, 31.8 tps, lat 1577.377 ms stddev 611.490, 0 failed
progress: 630.0 s, 30.7 tps, lat 1609.641 ms stddev 643.044, 0 failed
progress: 720.0 s, 33.5 tps, lat 1514.447 ms stddev 697.077, 0 failed
progress: 810.0 s, 29.4 tps, lat 1688.511 ms stddev 727.767, 0 failed
progress: 900.0 s, 31.8 tps, lat 1581.755 ms stddev 663.999, 0 failed
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 50
number of threads: 1
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 28317
number of failed transactions: 0 (0.000%)
latency average = 1569.627 ms
latency stddev = 683.913 ms
initial connection time = 12290.474 ms
tps = 31.821387 (without initial connection time)
postgres=>
I had 32 transactions per second, which means 32k rows were inserted per second. Adding the two index entries per row has written a hundred thousand rows and index entries per second.
This puts more pressure on the shared buffer because of the nature of B Tree indexes:
The IO operations match the number of rows and index entries written, a hundred thousand:
I said "rows" and "index entries" because "tuple" can be misleading, but we have statistics about the table rows, called "tuples":
Conclusion
I conducted those two tests to clarify what "WriteIOPS" measures with RDS Aurora. This metric is important because you might assume that your database requires high-throughput storage and could overestimate the infrastructure needs when migrating to another database. I'm writing this in response to a question about sizing a YugabyteDB cluster for 800,000 IOPS. YugabyteDB can provide 800k IOPS, given its horizontal scalability, but that was not necessary here.
If you do not use "Optimized IO", another metric called "VolumeWriteIOPs" is important, as it can increase your bill. However, it has a completely different definition. Be aware that the lowercase "s" signifies the plural form and does not refer to "seconds":
In the first run, the Volume Write IOPs increased to 246k IO operations over 5 minutes, which is approximately 820 IOs per second, aligning more closely with the storage IOPS of my workload.
The IO Operations from Performance Insight are more detailed than the WriteIOPS metric which sums the persistent writes to the distributed storage and the temporary file writes on the local instance. The Performance Insight metrics are also available in CloudWatch:
If you jumped straight to the end of the blog post, it's important to clarify that the graphic illustrating 100,000 disk IO operations per second does not represent actual disk IO operations. Instead, it more accurately reflects the logical writes to the storage log, distributed across the disks, and applied to materialize the PostgreSQL blocks.