Here is a quick test I did after reading:
https://twitter.com/robtreat2/status/1434304518716600322?s=20
This thread was mentioning performance of Aurora (with PostgreSQL compatibility) with a Graviton2 instance (comparted to Intel ones)
I'll use YBIO to generate load on the following Amazon RDS Aurora instances:
db.r5.xlarge | db.r6g.xlarge | |
---|---|---|
Processor | Intel Xeon Platinum 8175 | AWS Graviton2 |
Architecture | x86_64 | aarch64 (Arm Neoverse N2) |
Aurora version | PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit | PostgreSQL 11.9 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 7.4.0, 64-bit |
vCPU | 4 | 4 |
RAM GiB | 32 | 32 |
Region | eu-west-1 (Ireland) | eu-west-1 (Ireland) |
Aurora On Demand | $0.64/hour | $0.575/hour |
The Graviton2 instance is 20% cheaper and I'm checking that performance is still fine.
I'm installing YBIO on a VM that has access to the databases. No need for large size for this because YBIO runs in the database only, maximizing the database work. Cloud Shell is sufficient to call it:
cd ~ && git clone https://github.com/FranckPachot/ybio.git && cd ybio
db.r5.xlarge
I install YBIO procedures
export PGUSER=postgres PGPASSWORD=postgres PGPORT=5432 PGHOST=r5-instance-1.cvlvfe1jv6n5.eu-west-1.rds.amazonaws.com
psql < ~/ybio/ybio.sql
bulk insert
I load one table with 100 million rows, with 1000 batches of 100 thousand rows
[opc@C ~]$ psql <<<"call setup(tab_rows=>100000000,batch_size=>100000)"
NOTICE: Inserting 100000000 rows in 1000 batches of 100000
NOTICE: Table bench0001 Progress: .10 % ( 100000 rows) at 103756 rows/s
NOTICE: Table bench0001 Progress: .20 % ( 200000 rows) at 104920 rows/s
NOTICE: Table bench0001 Progress: .30 % ( 300000 rows) at 102453 rows/s
NOTICE: Table bench0001 Progress: .40 % ( 400000 rows) at 97177 rows/s
...
NOTICE: Table bench0001 Progress: 50.00 % ( 50000000 rows) at 85702 rows/s
...
NOTICE: Table bench0001 Progress: 99.90 % ( 99900000 rows) at 83635 rows/s
NOTICE: Table bench0001 Progress: 100.00 % ( 100000000 rows) at 83648 rows/s
This has been loaded at 83648 rows per second
updates with large batch size
On this set of 100 million rows I'll do random updates committed every 100000 rows, running for 1 hour
[opc@C ~]$ psql <<<"call runit(tab_rows=>100000000,batch_size=>100000,pct_update=>100,run_duration=>'60 minutes')"
NOTICE: 71058 rows/s on bench0001, job: 1 batch#: 1, total: 100000 rows read, 100.0 % updated, last: 100000 rows between 494793 and 394794
NOTICE: 67282 rows/s on bench0001, job: 1 batch#: 2, total: 200000 rows read, 100.0 % updated, last: 100000 rows between 8995239 and 8895240
NOTICE: 65460 rows/s on bench0001, job: 1 batch#: 3, total: 300000 rows read, 100.0 % updated, last: 100000 rows between 85103442 and 85003443
...
NOTICE: 48849 rows/s on bench0001, job: 1 batch#: 1757, total: 175700000 rows read, 100.0 % updated, last: 100000 rows between 93309062 and 93209063
NOTICE: 48845 rows/s on bench0001, job: 1 batch#: 1758, total: 175800000 rows read, 100.0 % updated, last: 100000 rows between 85690247 and 85590248
CALL
The rate was 48845 rows per second on average
Then I've run the same with 4 sessions in parallel and looked at Performance Insight
I can see my 4 sessions with, on average, 3.66/4=91.5% of the time active on CPU, with 6% waiting on commit (I've written in the past about the IO:XactSync wait event, specific to Aurora writing WAL on remote storage)
updates with small batch size
This is the same with more frequent commits, with a batch size of 10 rows
Now, among the 4 sessions, only 1.1 is active on average on CPU. Frequent commits are limited by sending the WAL to the Aurora storage nodes, and waiting the acknowledge on commit. Here is the rate of updates with those short transactions:
...
NOTICE: 2579 rows/s on bench0001, job: 3 batch#: 928436, total: 9284232 rows read, 100.0 % updated, last: 10 rows between 64549839 and 645498
30 NOTICE: 2582 rows/s on bench0001, job: 5 batch#: 929309, total: 9293000 rows read, 100.0 % updated, last: 10 rows between 41030399 and 410303
90
NOTICE: 2582 rows/s on bench0001, job: 4 batch#: 929521, total: 9295110 rows read, 100.0 % updated, last: 10 rows between 31828647 and 318286
38
NOTICE: 2579 rows/s on bench0001, job: 3 batch#: 928437, total: 9284242 rows read, 100.0 % updated, last: 10 rows between 20386279 and 203862
70
...
[1] Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[2] Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[3]- Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[4]+ Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
postgres=> select end_time-start_time duration,round(num_rows/extract(epoch from end_time-start_time)) riops
,round(100*max_scratch::float/table_scratch) as pct_scratch
, case when num_rows > table_rows then lpad(to_char(num_rows::float/table_rows,'xfmB9990D9'),6)
else lpad(to_char(100*num_rows/table_rows,'fmB999 %'),6) end coverage
,* from benchruns order by job_id desc nulls last limit 10;
duration | riops | pct_scratch | coverage | job_id | start_time | end_time | num_batches | num_rows | pct_update |max_scratch | prepared | index_only | tab_rows | batch_size | table_name | table_rows | table_scratch | comments
-----------------+-------+-------------+----------+--------+----------------------------+----------------------------+-------------+-----------+------------+-------------+----------+------------+-----------+------------+------------+------------+---------------+----------
01:00:00.00081 | 2583 | | | 5 | 2021-09-05 08:58:57.403667 | 2021-09-05 09:58:57.404477 | 929724 | 9297150 | 100 | 99999993 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.000344 | 2583 | | | 4 | 2021-09-05 08:58:56.947892 | 2021-09-05 09:58:56.948236 | 929747 | 9297370 | 100 | 99999976 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.001803 | 2579 | | | 3 | 2021-09-05 08:58:56.43201 | 2021-09-05 09:58:56.433813 | 928485 | 9284722 | 100 | 99999984 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.001239 | 2578 | | | 2 | 2021-09-05 08:58:55.684313 | 2021-09-05 09:58:55.685552 | 928084 | 9280760 | 100 | 99999994 | t | f | 100000000 | 10 | bench0001 | | |
01:00:01.509537 | 48841 | | | 1 | 2021-09-05 07:41:11.17698 | 2021-09-05 08:41:12.686517 | 1759 | 175900000 | 100 | 100000001 | t | f | 100000000 | 100000 | bench0001 | | |
(5 rows)
Of course this is slow when compared with the previous run: 10.3k rows per second in total from those 4 session which are actually using only one vCPU because they spend most of their time in writing the WAL to the remote storage. But this looks more like what an OLTP application is doing.
db.r6g.xlarge
I'll now compare the same on Graviton 2
export PGUSER=postgres PGPASSWORD=postgres PGPORT=5432 PGHOST=r6g-instance-1.cvlvfe1jv6n5.eu-west-1.rds.amazonaws.com
psql < ~/ybio/ybio.sql
bulk insert
[opc@C ~]$ psql <<<"call setup(tab_rows=>100000000,batch_size=>100000)"
NOTICE: Inserting 100000000 rows in 1000 batches of 100000
NOTICE: Table bench0001 Progress: .10 % ( 100000 rows) at 111652 rows/s
NOTICE: Table bench0001 Progress: .20 % ( 200000 rows) at 96538 rows/s
NOTICE: Table bench0001 Progress: .30 % ( 300000 rows) at 94140 rows/s
NOTICE: Table bench0001 Progress: .40 % ( 400000 rows) at 91229 rows/s
...
NOTICE: Table bench0001 Progress: 50.00 % ( 50000000 rows) at 82537 rows/s
...
NOTICE: Table bench0001 Progress: 99.90 % ( 99900000 rows) at 80259 rows/s
NOTICE: Table bench0001 Progress: 100.00 % ( 100000000 rows) at 80257 rows/s
We are 4% slower here to insert rows in bulk load. This is still ok given the 20% cheaper instance.
updates with large batch size
[opc@C ~]$ psql <<<"call runit(tab_rows=>100000000,batch_size=>100000,pct_update=>100,run_duration=>'60 minutes')"
NOTICE: 77075 rows/s on bench0001, job: 1 batch#: 1, total: 100000 rows read, 100.0 % updated, last: 100000 rows between 69757161 and 69657162
NOTICE: 67614 rows/s on bench0001, job: 1 batch#: 2, total: 200000 rows read, 100.0 % updated, last: 100000 rows between 45661380 and 45561381
NOTICE: 67075 rows/s on bench0001, job: 1 batch#: 3, total: 300000 rows read, 100.0 % updated, last: 100000 rows between 25263608 and 25163609
...
NOTICE: 47977 rows/s on bench0001, job: 1 batch#: 1726, total: 172600000 rows read, 100.0 % updated, last: 100000 rows between 56107266 and 56007267
NOTICE: 47974 rows/s on bench0001, job: 1 batch#: 1727, total: 172700000 rows read, 100.0 % updated, last: 100000 rows between 60313674 and 60213675
CALL
We are 2% slower with those updates, so still fine
With only 88% of the time on CPU it seems that we spend, proportionally, a little more time on system calls here on Graviton 2 when the 4 vCPU are busy
updates with small batch size
...
NOTICE: 2782 rows/s on bench0001, job: 5 batch#: 480115, total: 4801110 rows read, 100.0 % updated, last: 10 rows between 75413409 and 75413400
NOTICE: 2785 rows/s on bench0001, job: 3 batch#: 480855, total: 4808520 rows read, 100.0 % updated, last: 10 rows between 67380956 and 67380947
NOTICE: 2784 rows/s on bench0001, job: 2 batch#: 481035, total: 4810320 rows read, 100.0 % updated, last: 10 rows between 46272241 and 46272232
...
CALL
[1] Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[2] Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[3]- Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
[4]+ Done /usr/local/pgsql/bin/psql <<< "call runit(tab_rows=>100000000,batch_size=>10,pct_update=>100,run_duration=>'60 minutes')"
postgres=> select end_time-start_time duration,round(num_rows/extract(epoch from end_time-start_time)) riops
,round(100*max_scratch::float/table_scratch) as pct_scratch
, case when num_rows > table_rows then lpad(to_char(num_rows::float/table_rows,'xfmB9990D9'),6)
else lpad(to_char(100*num_rows/table_rows,'fmB999 %'),6) end coverage
,* from benchruns order by job_id desc nulls last limit 10;
duration | riops | pct_scratch | coverage | job_id | start_time | end_time | num_batches | num_rows | pct_update |max_scratch | prepared | index_only | tab_rows | batch_size | table_name | table_rows | table_scratch | comments
-----------------+-------+-------------+----------+--------+----------------------------+----------------------------+-------------+-----------+------------+-------------+----------+------------+-----------+------------+------------+------------+---------------+----------
01:00:00.002578 | 2802 | | | 5 | 2021-09-05 08:58:53.356155 | 2021-09-05 09:58:53.358733 | 1008742 | 10087340 | 100 | 100000002 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.002035 | 2802 | | | 4 | 2021-09-05 08:58:52.796646 | 2021-09-05 09:58:52.798681 | 1008736 | 10087280 | 100 | 99999984 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.000649 | 2802 | | | 3 | 2021-09-05 08:58:52.129614 | 2021-09-05 09:58:52.130263 | 1008730 | 10087260 | 100 | 100000000 | t | f | 100000000 | 10 | bench0001 | | |
01:00:00.001052 | 2800 | | | 2 | 2021-09-05 08:58:50.989991 | 2021-09-05 09:58:50.991043 | 1008024 | 10080150 | 100 | 99999987 | t | f | 100000000 | 10 | bench0001 | | |
01:00:02.253429 | 47970 | | | 1 | 2021-09-05 07:41:08.98654 | 2021-09-05 08:41:11.239969 | 1728 | 172800000 | 100 | 100000002 | t | f | 100000000 | 100000 | bench0001 | | |
(5 rows)
This is 11.2k rows per second in total from those 4 session which are actually using only one vCPU because they spend most of their time in writing the WAL to the remote storage. On this OLTP-like workload, we finally get a significant higher throughput with the Graviton2
The winner is the cheaper
In conclusion, please don't take those numbers as-is. They are relevant only within the context here. And depending on what you run you can show one or the other as the faster. Your application, you users, your peaks of activity are all different. What it shows is that there's no big difference with the performance of those workloads, whether in single session bulk DML, or in concurrent transactions with frequent commits. The main difference is the price: on AWS Graviton2 r6g is 20% cheaper than Intel r5 instances, so it is probably the best one to choose.
You expect me to do the same comparison with YugabyteDB? Sure, this will come soon. YBIO is designed to do the same micro-benchmark on any PostgreSQL compatible database.