If you use pgbench
with the default values and multiple threads, you do not test the scalability of your database because all threads are updating a single row. Depending on the isolation level, they either wait or fail on concurrent access. With the YugabyteDB version of pgbench
, the threads do not fail because a retry logic is implemented. Still, the pgbench application design uses more resources to provide a lower throughput.
This post proposes minimal changes to the database schema without changing the application code to make it scalable. I'm running this with YugabyteDB and yb_bench
to use the retry logic, but the same can be done with PostgreSQL and pgbench
.
Start YugabyteDB
I'll test on a local database:
docker run -d --name yb --hostname yb -p7000:7000 -p5433:5433 \
yugabytedb/yugabyte:latest bin/yugabyted start \
--daemon=false
docker exec -it yb postgres/bin/ysqlsh -h yb
You are at the psql
prompt, ysqlsh
is just a rename of it in YugabyteDB, to make it easy to run both in one environment (and with the YugabyteDB default port 5433). In the same way, ysql_bench
is a rename of pgbench
with a few improvements, like the retry logic (see a previous post).
ysql_bench -i
Here is the initialization:
\! postgres/bin/ysql_bench -i -h yb
With all defaults, this creates the tables and loads them with a scale of 1 (one branch).
yugabyte=# \! postgres/bin/ysql_bench -i -h yb
dropping old tables...
NOTICE: table "ysql_bench_accounts" does not exist, skipping
NOTICE: table "ysql_bench_branches" does not exist, skipping
NOTICE: table "ysql_bench_history" does not exist, skipping
NOTICE: table "ysql_bench_tellers" does not exist, skipping
creating tables (with primary keys)...
generating data...
100000 of 100000 tuples (100%) done (elapsed 5.58 s, remaining 0.00 s)
done.
"simple update" benchmark
I'm running the simple benchmark (builtin: simple update
) for 10 seconds (-T 10
) with 10 threads (-c 10
) showing the per-statement statistics (-r
) and no vacuum (-n
) as I'm running on YugabyteDB.
\! postgres/bin/ysql_bench -nrT10 -c10 -N -h yb
Here is the result - about 1700 transactions per second
yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 -N
transaction type: <builtin: simple update>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 17613
number of errors: 1 (0.006%)
maximum number of tries: 1
latency average = 5.684 ms (including errors)
tps = 1759.193190 (including connections establishing)
tps = 1762.223472 (excluding connections establishing)
statement latencies in milliseconds and errors:
0.006 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
0.214 0 BEGIN;
2.328 0 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.845 1 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
1.306 0 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.881 0 END;
With -N
, the updates are only per-account, and each thread touches a different account as they are randomly picked. Each statement execution is in milliseconds, so adding two more should stay in several hundreds of transactions per second.
"TPC-B (sort of)" benchmark
Without -N
, there are additional updates to maintain the per-teller and per-branch balance:
\! postgres/bin/ysql_bench -nrT10 -c10 -h yb
The result is very different here, with less than 100 transactions per second:
yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 932
number of errors: 9729 (91.258%)
maximum number of tries: 1
latency average = 9.397 ms (including errors)
tps = 93.027536 (including connections establishing)
tps = 93.208873 (excluding connections establishing)
statement latencies in milliseconds and errors:
0.013 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
0.211 0 BEGIN;
2.602 0 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.976 0 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
1.914 1805 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.915 6839 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.357 0 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.299 1085 END;
The important message is in number of errors
. pgbench
would have stopped the thread on the first error, which is not an option with a higher isolation level than the read committed. ysql_bench
adds a retry logic, but with no retry by default, and then reporting the errors (Operation failed. Try again: ... Conflicts with higher priority transaction: ...
or Operation expired: Transaction ... expired or aborted by a conflict: 40001
or Operation failed. Try again: Value write after transaction start: { physical: ... } >= { physical: ... }: kConflict
depending where the conflict occurs).
Optimistic locking retry logic
Here is the same attempting at most ten retries (--max-tries=10
):
yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 913
number of errors: 691 (43.080%)
number of retried: 1282 (79.925%)
number of retries: 8839
maximum number of tries: 10
latency average = 62.579 ms (including errors)
tps = 90.957832 (including connections establishing)
tps = 91.109947 (excluding connections establishing)
statement latencies in milliseconds, errors and retries:
0.012 0 0 \set aid random(1, 100000 * :scale)
0.001 0 0 \set bid random(1, 1 * :scale)
0.001 0 0 \set tid random(1, 10 * :scale)
0.001 0 0 \set delta random(-5000, 5000)
0.229 0 0 BEGIN;
2.623 0 0 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
1.008 0 0 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
1.953 138 1632 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.993 491 6174 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.394 0 0 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.314 62 1033 END;
The throughput is not better. Retry is OK only when transaction conflicts are rare, so optimistic locking is preferred. But here, all transactions are updating the same row in ysql_bench_branches
, and with only ten rows in ysql_bench_tellers
, the probability of conflict is high. This is especially true when transactions make many roundtrips to the server and are not using prepared statements: the longer the transaction time, the higher the probability of conflicts.
Scalability is also an application design concern
In short, the database can scale up with more CPU and memory and, with YugabyteDB, even scale out. However, the application must be designed to scale, which is not the case for the default pgbench
. Using prepared statements and doing only one roundtrip per transaction is well known. In this post, I'll focus on table design.
More rows on hotspot tables
The idea is simple: to avoid contention on one row, let's have multiple rows.
I create tables with an additional h
column that is included in the primary key:
create table h_ysql_bench_tellers(
h int, tid int, bid int, tbalance int, filler text
,primary key(h,tid)
);
create table h_ysql_bench_branches (
h int, bid int, bbalance int, filler text
,primary key(h,bid)
);
I fill this table with the rows from the original tables, but 100 rows for each:
insert into h_ysql_bench_tellers
select generate_series(0,99), * from ysql_bench_tellers;
insert into h_ysql_bench_branches
select generate_series(0,99), * from ysql_bench_branches;
drop table ysql_bench_branches;
I simplified things here by suggesting starting with an initialized database with all balances at zero. If not, you should change the total so that the balance doesn't change, putting the balance in only one bucket and zero for the other 99.
Then, a view will replace the tables that are updated by the application. It will update one of the 100 rows, depending on the pg_backend_pid()
here, but you can use whatever distributes the load evenly:
create or replace view ysql_bench_branches as
select * from h_ysql_bench_branches
where h=pg_backend_pid()%100;
drop table ysql_bench_tellers;
create or replace view ysql_bench_tellers as
select * from h_ysql_bench_tellers
where h=pg_backend_pid()%100;
Running with no application change
Now running the same as above:
\! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10 -h yb
The throughput is now 10x higher:
yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 --max-tries=10
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 9629
number of retried: 1 (0.010%)
number of retries: 1
maximum number of tries: 10
latency average = 10.399 ms
tps = 961.624034 (including connections establishing)
tps = 963.739905 (excluding connections establishing)
statement latencies in milliseconds and retries:
0.009 0 \set aid random(1, 100000 * :scale)
0.001 0 \set bid random(1, 1 * :scale)
0.001 0 \set tid random(1, 10 * :scale)
0.001 0 \set delta random(-5000, 5000)
0.221 0 BEGIN;
2.563 0 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.950 0 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
2.030 0 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
1.969 0 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.330 0 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
1.088 1 END;
The number of retries is minimal. It can increase with more connections, but the number of rows in ysql_bench_branches
and ysql_bench_tellers
can also be increased. Of course, it is better to run the benchmark longer. I'm in the same ballpark with 500 seconds:
yugabyte=# \! postgres/bin/ysql_bench -nrT500 -c10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 500 s
number of transactions actually processed: 383753
number of errors: 14 (0.004%)
maximum number of tries: 1
latency average = 13.030 ms (including errors)
tps = 767.452309 (including connections establishing)
tps = 767.484079 (excluding connections establishing)
Execution plan
It is important to verify that the execution plan scales:
begin transaction;
explain (costs off, analyze) UPDATE ysql_bench_branches SET bbalance = bbalance + 0 WHERE bid = 1;
rollback;
An Index Scan going to exactly one row, and one YugabyteDB tablet (thanks to hash sharding), is scalable:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Update on h_ysql_bench_branches (actual time=0.823..0.823 rows=0 loops=1)
-> Index Scan using h_ysql_bench_branches_pkey on h_ysql_bench_branches (actual time=0.792..0.794 rows=1 loops=1)
Index Cond: ((h = (pg_backend_pid() % 100)) AND (bid = 1))
Planning Time: 5.833 ms
Execution Time: 14.356 ms
(5 rows)
If needed, we can increase the number of rows and the % 100
when taking the modulo.
Queries
The beauty of it is that I didn't change the application, thanks to the power of SQL, here using views.
But be careful if you have the application querying those views:
yugabyte=# select * from ysql_bench_branches;
h | bid | bbalance | filler
---+-----+----------+--------
1 | 1 | 56792 |
(1 row)
yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
h | bid | bbalance | filler
----+-----+----------+--------
87 | 1 | 154853 |
(1 row)
yugabyte=# \c
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# select * from ysql_bench_branches;
h | bid | bbalance | filler
----+-----+----------+--------
99 | 1 | 56792 |
(1 row)
The select reads the view and then shows only the value for the current pg_backend_pid()
. It is essential to understand that this change is transparent to the application because you don't have to change the code, but you must know the queries run by other modules.
Here is how to query the balance from the tables:
yugabyte=#
select bid,sum(bbalance)
from h_ysql_bench_branches
group by bid;
bid | sum
-----+---------
1 | 5599083
(1 row)
yugabyte=#
select tid,bid,sum(tbalance)
from h_ysql_bench_tellers
group by grouping sets ( (tid,bid) , (bid) )
order by tid nulls last;
tid | bid | sum
-----+-----+----------
1 | 1 | 2384939
2 | 1 | 4526402
3 | 1 | 1442672
4 | 1 | -1166952
5 | 1 | 3525803
6 | 1 | -1542642
7 | 1 | -1783062
8 | 1 | 4453583
9 | 1 | -5990027
10 | 1 | -251633
| 1 | 5599083
(11 rows)
The beauty of SQL is that you can be 100% confident in the consistency. The total amount is the same, even with transaction errors and retries, because transactions are ACID.
If you need application transparency for queries, there are other solutions. It all depends on your knowledge of queries. For example, you can add the where h=pg_backend_pid()%100
logic with RLS policies, update the base table through an instead-off trigger, and leave the view showing the balance for all rows.
Update 2022-03-16
Adrian remarked on an important point: if you have check constraints, you can't easily enforce them on multi-rows. It is a balance between consistency (the C in ACID, not the C in CAP) and scalability:
Update 2024-06-10 with Read Committed isolation level
This was written when YugabyteDB was using Repeatable Read isolation level with fail-on-conflict. To avoid retriable errors (--max-tries
), you can run in Read Committed with wait-on-conflict:
docker run -d --name yb --hostname yb -p7000:7000 -p5433:5433 \
yugabytedb/yugabyte:latest bin/yugabyted start \
--tserver_flags=yb_enable_read_committed_isolation=true \
--background=false
You get better performance, and no need for a retry logic, even with the default PgBench design where all sessions update the same row:
yugabyte=# \! postgres/bin/ysql_bench -nrT10 -c10 -h yb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
batch size: 1024
duration: 10 s
number of transactions actually processed: 3398
maximum number of tries: 1
latency average = 29.539 ms
tps = 338.538140 (including connections establishing)
tps = 339.196855 (excluding connections establishing)
statement latencies in milliseconds:
0.012 \set aid random(1, 100000 * :scale)
0.002 \set bid random(1, 1 * :scale)
0.002 \set tid random(1, 10 * :scale)
0.002 \set delta random(-5000, 5000)
0.201 BEGIN;
1.596 UPDATE ysql_bench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.883 SELECT abalance FROM ysql_bench_accounts WHERE aid = :aid;
9.856 UPDATE ysql_bench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
14.270 UPDATE ysql_bench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.158 INSERT INTO ysql_bench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.928 END;
YugabyteDB is PostgreSQL-compatible and open source. Read Committed is not enabled by default for backward compatibility, but it is the recommended setting, as a PostgreSQL-compatible database must provide all isolation levels to have the same runtime behavior as PostgreSQL.