In his book "Designing Data-Intensive Applications", Martin Kleppmann explains the concept of write-skew using a simple example: the on-call doctors for a shift in a hospital. Multiple doctors can be on call for a shift, requiring at least one doctor. If a doctor wants to give up his shift, he must ensure that at least two doctors are on call so that there is always one remaining.
Here is the example:
yugabyte=# create table doctors (
primary key (shift_id, name)
,shift_id int, name text, on_call boolean
);
CREATE TABLE
yugabyte=# insert into doctors values (1, 'Alice', true), (1, 'Bob', true);
INSERT 0 2
yugabyte=# -- Bob
yugabyte=# begin;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1
yugabyte=*# commit;
COMMIT
yugabyte=# select * from doctors where shift_id = 1;
shift_id | name | on_call
----------+-------+---------
1 | Alice | t
1 | Bob | f
(2 rows)
Bob confirmed that there were enough doctors. He gave up his shift, and Alice remained on call. This is correct.
Concurrent transactions in Read Committed
Alice may have the same intention at the same time. I simulate that by interleaving the transactions and calling another psql
from the main interactive psql
.
yugabyte=# -- Bob
yugabyte=# begin;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
yugabyte=*# -- Alice
yugabyte=*# \! psql -c "begin" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false
where shift_id = 1 and name = 'Alice'" -c "commit"
BEGIN
count
-------
2
(1 row)
UPDATE 1
COMMIT
yugabyte=*# -- Bob
yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1
yugabyte=*# commit;
COMMIT
yugabyte=# select * from doctors where shift_id = 1;
shift_id | name | on_call
----------+-------+---------
1 | Alice | f
1 | Bob | f
(2 rows)
At the end, there are no doctors on call because the SQL transactions are isolated. This means they cannot see each other's actions. Both transactions made decisions based on a state that the other was changing. From a user's point of view, it is an anomaly because each doctor checked to see if another doctor was on call.
The database expects this behavior because the default isolation level is Read Committed. Each statement can read from a different database state, allowing for a write skew in the transaction. One way to address this issue is by using the serializable isolation level. In this scenario, the read-write conflict will be detected.
Concurrent transactions in Serializable
In the Serializable isolation level, PostgreSQL will detect conflicts and prevent the second update from being performed:
postgres=# -- Bob
postgres=# begin isolation level serializable;
BEGIN
postgres=*# select count(*) from doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
postgres=*# -- Alice
postgres=*# \! psql -c "begin isolation level serializable" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false where shift_id = 1 and name = 'Alice'" -c "commit"
BEGIN
count
-------
2
(1 row)
UPDATE 1
COMMIT
postgres=*# -- Bob
postgres=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
ERROR: 40001: could not serialize access due to read/write dependencies among transactions
DETAIL: Reason code: Canceled on identification as a pivot, during write.
HINT: The transaction might succeed if retried.
LOCATION: OnConflict_CheckForSerializationFailure, predicate.c:4598
postgres=!# commit;
ROLLBACK
postgres=# select * from doctors where shift_id = 1;
shift_id | name | on_call
----------+-------+---------
1 | Bob | t
1 | Alice | f
(2 rows)
In this case, Bob's transaction was canceled. He can retry and see only one doctor on call. Then, he can decide not to give up his on-call. PostgreSQL allowed the first update and later detected the conflict.
YugabyteDB behaves similarly but uses pessimistic locking by default (Wait-on-Conflict instead of Fail-on-Conflict). With pessimistic locking, it waits (until timeout) on the other transaction while I run the second transaction in the background.
yugabyte=# -- Bob
yugabyte=# begin isolation level serializable;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
yugabyte=*# -- Alice
yugabyte=*# \! psql -c "begin isolation level serializable" -c "select count(*) from doctors where shift_id = 1 and on_call" -c "update doctors set on_call = false where shift_id = 1 and name = 'Alice'" -c "commit" &
BEGIN
count
-------
2
(1 row)
yugabyte=*# -- Bob
yugabyte=*# update doctors set on_call = false where shift_id = 1 and name = 'Bob';
yugabyte=*# -- Alice session gets:
ERROR: 40P01: deadlock detected (query layer retry is not possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation transaction, consider increasing the tserver gflag ysql_output_buffer_size)
DETAIL: Transaction aa479225-9aeb-4560-a320-6d704ec13b37 aborted due to a deadlock: <1729284946537561>0c4a7f52-4830-459e-98fc-a57016b3ff34-><1729284949382417>aa479225-9aeb-4560-a320-6d704ec13b37->: kDeadlock
yugabyte=*# ROLLBACK
yugabyte=*# -- Bob session gets:
yugabyte=*# commit;
COMMIT
yugabyte=# select * from doctors where shift_id = 1;
shift_id | name | on_call
----------+-------+---------
1 | Bob | f
1 | Alice | t
(2 rows)
Alice's update was waiting for Bob's completion because Bob had declared his serializable intent to read the data that Alice was going to update. Similarly, Bob was in the same situation, intending to update the record that Alice had read. This resulted in a deadlock, and Alice's transaction was canceled.
Internally, YugabyteDB used a range lock to avoid locking the whole table. If you try the same with multiple sessions updating the on-call doctor for different shifts, they will not block each other.
YugabyteDB bases this optimization on the SQL schema by locking a subset of a primary key. Here, the primary key is (shift_id, name)
, and reading where shift_id = 1
locks only the value "1" of the first column.
This is visible by querying pg_locks
after Bob's read:
yugabyte=# create table doctors (shift_id int, name text, on_call boolean, primary key(shift_id,name)) ;
CREATE TABLE
yugabyte=# insert into doctors values (1, 'Alice', true), (1, 'Bob', true);
INSERT 0 2
yugabyte=# -- Bob
yugabyte=# begin isolation level serializable;
BEGIN
yugabyte=*# select count(*) from doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
yugabyte=*# select locktype, mode, granted, ybdetails->'keyrangedetails' from pg_locks;
locktype | mode | granted | ?column?
----------+-------------+---------+----------------------------------------------------------------------------------
relation | WEAK_READ | t | {"cols": null, "attnum": null, "column_id": null, "multiple_rows_locked": true}
keyrange | STRONG_READ | t | {"cols": ["1"], "attnum": null, "column_id": null, "multiple_rows_locked": true}
(2 rows)
The lock at the table level (relation
) is weak, and weak locks do not conflict. The strong read lock, which conflicts with strong write locks, is on a range (keyrange
) determined by the value "1" in first column of the primary key ("cols": ["1"]
). To avoid locking complex predicates or ranges, without locking the whole table, YugabyteDB relies on the relational model.
YugabyteDB uses range locks based on the values of a composite primary key as a trade-off between locking the entire table and recording all predicates. This implementation provides insights into how to prevent write skew in databases that do not support the ANSI/ISO isolation level.
Relational Model and Read Committed
Let's create a schema that can prevent write skew, without range locks, in Read Committed isolation level. The on-call doctors scenario deals with two business entities: doctors and shifts.
They were visible in the composite primary key, but we can create one table for each, so that we can use row locking:
create table shifts (shift_id int primary key);
create table shift_doctors (
shift_id int references shifts, name text, on_call boolean
);
insert into shifts values (1);
insert into shift_doctors
values (1, 'Alice', true), (1, 'Bob', true);
With such a schema, the application can use explicit locking to serialize the transactions that work on the same shift with a select for update
:
select from shifts where shift_id = 1 for update;
select count(*) from shift_doctors where shift_id = 1 and on_call;
Here is the same example on this new schema, in Read Committed isolation level with explicit locking:
yugabyte=# -- Bob
yugabyte=# begin isolation level read committed;
BEGIN
yugabyte=*# select from shifts where shift_id = 1 for update;
--
(1 row)
yugabyte=*# select count(*) from shift_doctors where shift_id = 1 and on_call;
count
-------
2
(1 row)
yugabyte=*# -- Alice
yugabyte=*# \! psql -c "begin isolation level read committed" -c "select from shifts where shift_id = 1 for update" -c "select count(*) from shift_doctors where shift_id = 1 and on_call" &
BEGIN
yugabyte=*# -- Alice
yugabyte=*# \! psql -c "begin isolation level read committed" -c "select from shifts where shift_id = 1 for update" -c "select count(*) from shift_doctors where shift_id = 1 and on_call" &
yugabyte=*# BEGIN
yugabyte=*# -- Bob
yugabyte=*# update shift_doctors set on_call = false where shift_id = 1 and name = 'Bob';
UPDATE 1
yugabyte=*# commit;
COMMIT
yugabyte=# -- Alice
(1 row)
count
-------
1
(1 row)
In this situation, Alice waited for the SELECT FOR UPDATE, which involves the same row for the two sessions (one row per shift in the "shifts" table). After Bob's session was committed, Alice's transaction continued. Since it is Read Committed, it could see Bob's committed change. Then, upon realizing that there were no other doctors on call, she decided to stay on call.
The explicit lock acquired by using SELECT FOR UPDATE on the "shifts" table is similar to the implicit lock acquired on the "shift_id" column in the single-table case and serializable isolation level:
yugabyte=*# select locktype, mode, granted, ybdetails->'keyrangedetails' from pg_locks;
locktype | mode | granted | ?column?
----------+--------------------------+---------+-----------------------------------------------------------------------------------
relation | WEAK_READ,WEAK_WRITE | t | {"cols": null, "attnum": null, "column_id": null, "multiple_rows_locked": true}
row | STRONG_READ,STRONG_WRITE | t | {"cols": ["1"], "attnum": null, "column_id": null, "multiple_rows_locked": false}
(2 rows)
In contrast to many other databases, YugabyteDB can display row locks in the pg_locks
table. The serializable isolation level shows range locks, with the intent value from implicit locking predicates. The read committed isolation level displays the actual value that was read with explicit locking. For security reasons, you need to have superuser privileges to view these values.
Using SELECT FOR UPDATE allows you to achieve the same business logic as serializable transactions. Due to data consistency, relying on application design requires more code and tests but does not need to implement retry logic. In some cases, you have no choice. Oracle Database does not support serializable transactions to avoid write skew and does not implement SELECT FOR SHARE. Applications can still maintain consistency by using SELECT FOR UPDATE in this scenario, and critical applications have been running like this for decades. On the other hand, a serializable isolation level may be preferred in PostgreSQL because using the FOR UPDATE/SHARE clause in Read Committed is not entirely isolated from concurrent transaction commits.
YugabyteDB implements all SQL isolation levels according to the ANSI/ISO definition and is compatible with PostgreSQL runtime behavior. It also supports implicit locking in share and exclusive mode, SELECT FOR SHARE and SELECT FOR UPDATE, both with full transaction isolation. Instead of using FOR SHARE, I chose FOR UPDATE to avoid a deadlock situation similar to the one we experienced at the serializable isolation level. The choice between shared or exclusive row lock represents a choice of pessimistic or optimistic locking to fit the most probable intention.