In today's Open Hours, we had a discussion about YugabyteDB's Wait-on-Conflict behavior and demonstrated it with Read Committed. Later that day, a customer asked about Error-On-Conflict raising 40001. They ran a query in Repeatable Read mode without Wait Queues and encountered an error when one session updated all rows in a large table while another updated just one row. I reproduced the problem, demonstrated the behavior with different settings, including Read Committed and Wait Queues, and found some interesting things to share.
I need a table that doesn't have to be too large but should have a slow update. I use a slow()
function when reading only 10 rows. My all-rows update will take 10 seconds. We will see later that it may not be the best idea, but it is more straightforward than attaching a debugger and setting some. I run a concurrent update on the 9th row, which happens before the full update reads it, and I run another concurrent update on the 1st row, occurring after the full update reads and updates it. To ensure the rows are read in a specific order, I added a primary key in ascending order.
The version I used for the Open Hours demo was 2.20.1.0-b97
.
I start with the following, which are the defaults:
yb_enable_read_committed_isolation=false,enable_wait_queues=false
To achieve the same result, you can simply copy and paste the following Docker command in a lab:
docker run -i --rm yugabytedb/yugabyte:2.20.1.0-b97 bash -x <<'BASH'
## start a one-node cluster
yugabyted start --tserver_flags='yb_enable_read_committed_isolation=false,enable_wait_queues=false'
## Trace the restarts and show the PostgreSQL logfile in the background
export PGOPTIONS='-c yb_debug_log_internal_restarts=on'
tail -F /root/var/logs/tserver/postgres*log &
## create the demo table and the slow function
yugabyted connect ysql << 'SQL'
create table demo ( k int, v int, primary key(k asc));
insert into demo select generate_series(1,10), 0 ;
create or replace function slow(n int) returns int as $$
-- waits one second and return the same value as the input
begin perform pg_sleep(1); return n; end;
$$ language plpgsql;
SQL
## full-table update waiting
yugabyted connect ysql <<<'update demo set v=v+10 where slow(k)=k returning k, v, clock_timestamp();' &
sleep 3
## once I'm sure the first row has been updated, but before the 9th row is updated (after 3 seconds we should be between them)
## update row 9 and row 1 and display the clock time
yugabyted connect ysql <<<'update demo set v=v+100 where k=9 returning k, v, clock_timestamp();'
yugabyted connect ysql <<<'update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();'
## wait for the long update to commit
sleep 15
BASH
Repeatable Read with Fail-on-Conflict
With yb_enable_read_committed_isolation=false
and enable_wait_queues=false
the conflict on the 1st row kills one session for the other to continue (which one depends on random priorities):
+ yugabyted connect ysql
k | v | clock_timestamp
---+-----+------------------------------
9 | 100 | 2024-01-10 18:50:39.54409+00
(1 row)
UPDATE 1
+ yugabyted connect ysql
k | v | clock_timestamp
---+------+-------------------------------
1 | 1000 | 2024-01-10 18:50:39.739687+00
(1 row)
UPDATE 1
+ sleep 15
ERROR: Heartbeat: Transaction f8d060f7-1ed0-4dd8-8ebd-903f1aea5d48 expired or aborted by a conflict: 40001: . Errors from tablet servers: [Operation expired (yb/tablet/transaction_coordinator.cc:1722): Heartbeat: Transaction f8d060f7-1ed0-4dd8-8ebd-903f1aea5d48 expired or aborted by a conflict: 40001 (pgsql error 40001) (transaction error 1)]
In this case, the application has to handle the retry logic. Additionally, the rollback and restart of a full update on a large table is not optimal.
Read Committed with Fail-on-Conflict
With yb_enable_read_committed_isolation=true
and enable_wait_queues=false
, Read Committed restarts at a newer read time when encountering a conflict on the 1st row, until the other transaction commits:
+ yugabyted connect ysql
k | v | clock_timestamp
---+-----+-------------------------------
9 | 100 | 2024-01-10 18:54:35.991011+00
(1 row)
UPDATE 1
2024-01-10 18:54:36.184 UTC [311] LOG: Restart is possible
2024-01-10 18:54:36.184 UTC [311] STATEMENT: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
2024-01-10 18:54:36.184 UTC [311] LOG: Restarting statement due to kReadRestart/kConflict error:
Query: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
Error: could not serialize access due to concurrent update
Attempt No: 0
2024-01-10 18:54:36.184 UTC [311] STATEMENT: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
...
2024-01-10 18:54:39.694 UTC [311] LOG: Restart is possible
2024-01-10 18:54:39.694 UTC [311] STATEMENT: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
2024-01-10 18:54:39.694 UTC [311] LOG: Restarting statement due to kReadRestart/kConflict error:
Query: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
Error: could not serialize access due to concurrent update
Attempt No: 6
2024-01-10 18:54:39.694 UTC [311] STATEMENT: update demo set v=v+1000 where k=1 returning k, v, clock_timestamp();
k | v | clock_timestamp
----+----+-------------------------------
1 | 10 | 2024-01-10 18:54:34.031129+00
2 | 10 | 2024-01-10 18:54:35.033599+00
3 | 10 | 2024-01-10 18:54:36.035551+00
4 | 10 | 2024-01-10 18:54:37.036974+00
5 | 10 | 2024-01-10 18:54:38.038913+00
6 | 10 | 2024-01-10 18:54:39.040828+00
7 | 10 | 2024-01-10 18:54:40.042877+00
8 | 10 | 2024-01-10 18:54:41.044863+00
9 | 10 | 2024-01-10 18:54:42.046403+00
10 | 10 | 2024-01-10 18:54:43.048304+00
(10 rows)
UPDATE 10
k | v | clock_timestamp
---+------+-------------------------------
1 | 1010 | 2024-01-10 18:54:43.703361+00
(1 row)
UPDATE 1
+ sleep 15
You may wonder why the full update didn't restart when it encountered the 9th row, which was concurrently updated after the read time. Me too. I would expect a restart that sees the new value and updates the 9th row to value 110. Check issue #20549 and remember that Read Committed is still in preview in this release, not GA (General Availability). This may be a side effect of my slow()
function.
Read Committed with Wait-on-Conflict
With yb_enable_read_committed_isolation=true
and enable_wait_queues=true
, the update on the 1st row, that detects the conflict, waits for the full update to commit:
+ yugabyted connect ysql
k | v | clock_timestamp
---+-----+-------------------------------
9 | 100 | 2024-01-10 18:53:37.943537+00
(1 row)
UPDATE 1
+ yugabyted connect ysql
k | v | clock_timestamp
----+----+-------------------------------
1 | 10 | 2024-01-10 18:53:36.004246+00
2 | 10 | 2024-01-10 18:53:37.006284+00
3 | 10 | 2024-01-10 18:53:38.008266+00
4 | 10 | 2024-01-10 18:53:39.010164+00
5 | 10 | 2024-01-10 18:53:40.01211+00
6 | 10 | 2024-01-10 18:53:41.013585+00
7 | 10 | 2024-01-10 18:53:42.015461+00
8 | 10 | 2024-01-10 18:53:43.017792+00
9 | 10 | 2024-01-10 18:53:44.019945+00
10 | 10 | 2024-01-10 18:53:45.021895+00
(10 rows)
UPDATE 10
k | v | clock_timestamp
---+------+-------------------------------
1 | 1010 | 2024-01-10 18:53:45.024199+00
(1 row)
UPDATE 1
+ sleep 15
The behavior, from a user point of view, is the same as with transparent retries, and shows the same anomaly where the full update didn't restart. The clock_timestamp()
shows when the updates occurred, and the update on the 9th row was ignored by the full update, which started before but finished after.
Repeatable Read with Wait-on-Conflict
With yb_enable_read_committed_isolation=false
and enable_wait_queues=true
, the update on the 1st row waits, and the full update detecting the conflict on the 9th row can restart because it is the first statement of the transaction:
+ yugabyted connect ysql
k | v | clock_timestamp
---+-----+------------------------------
9 | 100 | 2024-01-10 20:23:51.64415+00
(1 row)
UPDATE 1
k | v | clock_timestamp
---+------+-------------------------------
1 | 1000 | 2024-01-10 20:23:57.703115+00
(1 row)
UPDATE 1
+ sleep 15
2024-01-10 20:23:57.702 UTC [278] LOG: Restart is possible
2024-01-10 20:23:57.702 UTC [278] STATEMENT: update demo set v=v+10 where slow(k)=k returning k, v, clock_timestamp();
2024-01-10 20:23:57.702 UTC [278] LOG: Restarting statement due to kReadRestart/kConflict error:
Query: update demo set v=v+10 where slow(k)=k returning k, v, clock_timestamp();
Error: could not serialize access due to concurrent update
Attempt No: 0
2024-01-10 20:23:57.702 UTC [278] STATEMENT: update demo set v=v+10 where slow(k)=k returning k, v, clock_timestamp();
k | v | clock_timestamp
----+------+-------------------------------
1 | 1010 | 2024-01-10 20:23:58.704736+00
2 | 10 | 2024-01-10 20:23:59.707063+00
3 | 10 | 2024-01-10 20:24:00.709297+00
4 | 10 | 2024-01-10 20:24:01.711381+00
5 | 10 | 2024-01-10 20:24:02.713267+00
6 | 10 | 2024-01-10 20:24:03.715094+00
7 | 10 | 2024-01-10 20:24:04.716999+00
8 | 10 | 2024-01-10 20:24:05.718952+00
9 | 110 | 2024-01-10 20:24:06.720901+00
10 | 10 | 2024-01-10 20:24:07.72285+00
(10 rows)
UPDATE 10
This is the correct behavior. From a performance point of view, it would have been better to get the more minor updates to restart, but the update on the 9th row is already committed when the full update encounters it. The full update takes 19 seconds because it has been restarted after 9 seconds. You can achieve this with a Serializable isolation level that acquires a read lock on each row so the others will wait to execute later.
If you add set default_transaction_isolation=serializable;
in front of each update you will see that the entire update is the first one that completes and takes 10 seconds. The others wait for it.
When using the Repeatable Read isolation level, the read time is set at the beginning of the transaction. It might be surprising to encounter a restart while using this isolation level, but it's not directly dependent on the level itself. A statement can be restarted if it can advance the read time. This is only possible in Read Committed when the statement has not yet returned data to the application. Additionally, it is only possible in a Repeatable Read when this statement is the first one in the transaction.
Repeatable Read with Wait-on-Conflict is a good isolation level if you are using single-statement transactions, such as with auto-commit, that avoids anomalies without additional locking, and handles single-transaction retries automatically and transparently without requiring the application to add more code. If you know that you will be acquiring a Write lock for each row, then the Serializable read locks will not add much overhead and may even improve performance in some cases.