In the YugabyteDB documentation about transactions, there's a description of two transaction paths:
- Single-row transactions, also called 'fast-path', where rows can directly be written to the final LSM-Tree (an enhanced RocksDB) for the tablet (which is the name for a table/index shard)
- Distributed transactions where the SQL transaction does more and some information to persist are known only at COMMIT (the commit status and time). YugabyteDB writes all provisional records to the IntentsDB, another LSM-Tree for the tablet, and stores the transaction status in a transaction table so that other transactions can know which changes are visible to others.
In this blog post, I'll show how to see exactly what is written, where, and when.
Tracing write operations to IntentsDB and RegularDB in a lab
I start a YugabyteDB container with tracing all writes as in the previous post and enter the YSQL shell:
docker exec -it $(
docker run -d yugabytedb/yugabyte:latest sleep infinity
) bash
yugabyted start --advertise_address=0.0.0.0 \
--tserver_flags="TEST_docdb_log_write_batches=true,tserver_enable_metrics_snapshotter=false"
until postgres/bin/pg_isready ; do sleep 1 ; done | uniq
tail -f /root/var/logs/tserver/yb-tserver.INFO | grep -E '^ | tablet.cc:' &
ysqlsh
All writes to the YugabyteDB tablets will be logged to /root/var/logs/tserver/yb-tserver.INFO
and I disabled the metrics snapshotter because it also writes to tablets and I want to see only my own writes. I tail the log to my console to see the writes while I'm testing some DML.
The logs will show one line per batch of writes from tablet.cc
with the number of key/value pairs
and which LSM-Tree (kIntents RocksDB
or kRegular RocksDB
). Then follows the detail with one line per key/value, numbered, and mentioning which LSM-Tree ([I]
ntents or [R]
egular).
I create a demo table. The default logs show the creation of the tablet:
yugabyte=# create table demo(id bigint primary key, value text);
CREATE TABLE
I0611 19:41:25.209566 1512 tablet.cc:469] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Schema version for demo is 0
I0611 19:41:25.209762 1512 tablet.cc:600] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Creating RocksDB database in dir /root/var/data/yb-data/tserver/data/rocksdb/table-000033f1000030008000000000004003/tablet-3ecc7bc3dcc049d0b7e9976c7971adde
I0611 19:41:25.236333 1512 tablet.cc:797] Opening RocksDB at: /root/var/data/yb-data/tserver/data/rocksdb/table-000033f1000030008000000000004003/tablet-3ecc7bc3dcc049d0b7e9976c7971adde
I0611 19:41:25.286031 1512 tablet.cc:812] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Opening intents DB at: /root/var/data/yb-data/tserver/data/rocksdb/table-000033f1000030008000000000004003/tablet-3ecc7bc3dcc049d0b7e9976c7971adde.intents
I0611 19:41:25.339751 1512 tablet.cc:863] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Successfully opened a RocksDB database at /root/var/data/yb-data/tserver/data/rocksdb/table-000033f1000030008000000000004003/tablet-3ecc7bc3dcc049d0b7e9976c7971adde, obj: 0x55c164a66000
T 3ecc7bc3dcc049d0b7e9976c7971adde
is the identifier of the tablet, and P db20a8214add45e69bd4f1e168772389
is the tablet peer. In this single-node cluster, the table starts with one tablet only, and one peer, the leader.
A simple insert: fast path
I insert a single row in this simple table:
yugabyte=# insert into demo values(42,'answer');
INSERT 0 1
I0611 19:43:10.814085 119 tablet.cc:1489] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Wrote 2 key/value pairs to kRegular RocksDB:
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 1. PutCF: SubDocKey(DocKey(0xbf4f, [42], []), [SystemColumnId(0); HT{ physical: 1686512590795474 }]) => null
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 2. PutCF: SubDocKey(DocKey(0xbf4f, [42], []), [ColumnId(1); HT{ physical: 1686512590795474 w: 1 }]) => "answer"
The most important information here is: Wrote 2 key/value pairs to kRegular RocksDB
. The two key/values are detailed in the next lines. Both have the key ([42]
with its hash value in front for sharding). One is for the row itself (SystemColumnId(0)
), and one is for the first non-key column (ColumnId(1)
) setting its value (=> "answer"
). They have both their MVCC version as Hybrid Time (HT
) with the epoch as its physical time.
SQL distributed transactions
This example was simple and fast. The SQL table row was inserted as a single key-value document. YugabyteDB has automatically used the "fast path" here. However, SQL transactions can be more complex than that:
- multiple rows, which can be on different nodes
- read other tables to check foreign keys
- update secondary indexes
- joins to additional tables
- mode statement in one transaction
- multi-statement transactions
First, in a complex transaction, the MVCC Hybrid Timestamp is not known before the end of the transaction (COMMIT time). This means that we cannot write directly the key/value as we have seen above. The transaction provisional records will be written to another structure of the same tablet, with a reference to a transaction table. At commit, the status will be set in the transaction table and the tablets will be able to apply the provisional records to the final structure.
Those structures are LSM-Tree: IntentsDB for the provisional records and RegularDB for the final ones. In the trace above the key/values were written directly to the kRegular RocksDB
and each operation was tagged with [R]
to identify the regular database.
YugabyteDB
Let's insert two rows in a single transaction:
yugabyte=# insert into demo values(1,'HitchHiker'),(2,'Restaurant');
INSERT 0 2
I0611 20:07:45.474311 3384 tablet.cc:1489] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Wrote 15 key/value pairs to kIntents RocksDB:
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 1. PutCF: TXN META 4c256c42-2ee1-4be1-8c8f-002e477c3280 => { transaction_id: 4c256c42-2ee1-4be1-8c8f-002e477c3280 isolation: SNAPSHOT_ISOLATION status_tablet: 5ec8daec64cd451395b8306ad26e4c80 priority: 5525364300130340406 start_time: { physical: 1686514065448922 } locality: GLOBAL old_status_tablet: external_transaction: 0}
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 2. PutCF: SubDocKey(DocKey(0xeda9, [1], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) WriteId(0) null
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 3. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 } => SubDocKey(DocKey(0xeda9, [1], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 4. PutCF: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 1 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) WriteId(1) "HitchHiker"
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 5. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 1 } => SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 1 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 6. PutCF: SubDocKey(DocKey(0xcfaa, [2], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 2 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) WriteId(2) null
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 7. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 2 } => SubDocKey(DocKey(0xcfaa, [2], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 2 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 8. PutCF: SubDocKey(DocKey(0xcfaa, [2], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 3 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) WriteId(3) "Restaurant"
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 9. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 3 } => SubDocKey(DocKey(0xcfaa, [2], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 3 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 10. PutCF: SubDocKey(DocKey(0xcfaa, [2], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 4 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) none
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 11. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 4 } => SubDocKey(DocKey(0xcfaa, [2], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 4 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 12. PutCF: SubDocKey(DocKey(0xeda9, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 5 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) none
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 13. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 5 } => SubDocKey(DocKey(0xeda9, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 5 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 14. PutCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 6 } => TransactionId(4c256c42-2ee1-4be1-8c8f-002e477c3280) none
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 15. PutCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 6 } => SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 6 }
I0611 20:07:45.476101 119 tablet.cc:1489] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Wrote 4 key/value pairs to kRegular RocksDB:
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 1. PutCF: SubDocKey(DocKey(0xeda9, [1], []), [SystemColumnId(0); HT{ physical: 1686514065475595 }]) => null; intent doc ht: HT{ physical: 1686514065465871 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 2. PutCF: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1); HT{ physical: 1686514065475595 w: 1 }]) => "HitchHiker"; intent doc ht: HT{ physical: 1686514065465871 w: 1 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 3. PutCF: SubDocKey(DocKey(0xcfaa, [2], []), [SystemColumnId(0); HT{ physical: 1686514065475595 w: 2 }]) => null; intent doc ht: HT{ physical: 1686514065465871 w: 2 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [R]: 4. PutCF: SubDocKey(DocKey(0xcfaa, [2], []), [ColumnId(1); HT{ physical: 1686514065475595 w: 3 }]) => "Restaurant"; intent doc ht: HT{ physical: 1686514065465871 w: 3 }
I0611 20:07:45.476333 200 tablet.cc:1489] T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389: Wrote 15 key/value pairs to kIntents RocksDB:
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 1. SingleDeleteCF: TXN META 4c256c42-2ee1-4be1-8c8f-002e477c3280
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 2. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 3. SingleDeleteCF: SubDocKey(DocKey(0xeda9, [1], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 4. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 1 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 5. SingleDeleteCF: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 1 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 6. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 2 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 7. SingleDeleteCF: SubDocKey(DocKey(0xcfaa, [2], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 2 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 8. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 3 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 9. SingleDeleteCF: SubDocKey(DocKey(0xcfaa, [2], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ physical: 1686514065465871 w: 3 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 10. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 4 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 11. SingleDeleteCF: SubDocKey(DocKey(0xcfaa, [2], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 4 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 12. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 5 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 13. SingleDeleteCF: SubDocKey(DocKey(0xeda9, [1], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 5 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 14. SingleDeleteCF: TXN REV 4c256c42-2ee1-4be1-8c8f-002e477c3280 HT{ physical: 1686514065465871 w: 6 }
T 3ecc7bc3dcc049d0b7e9976c7971adde P db20a8214add45e69bd4f1e168772389 [I]: 15. SingleDeleteCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ physical: 1686514065465871 w: 6 }
There's a lot more here, with 3 batches of writes:
- Wrote 15 key/value pairs to kIntents RocksDB during the insert
- Wrote 4 key/value pairs to kRegular RocksDB after the commit
- Wrote 15 key/value pairs to kIntents RocksDB
Only the first one, to kIntents RocksDB
, the IntentsDB with provisional records, happens during the user command. Once the transaction is committed, all other sessions can read the provisional records and assess about their visibility from the transaction status and timestamp. This first batch writes 15 key/value pairs including the changes but also the transaction metadata and the reverse index (to find all changes related to one transaction). After the commit, each tablet will apply the provisional records to the RegularDB and this is 4 key/value pairs similar to what we have seen with the fast path: 1 per row and 1 per non-key column. The third batch is the cleanup of the 15 provisional records in the IntentsDB.
What are those 15 provisional records? They are the same key/value pairs that we have seen in the fast path, plus transaction control information.
-
1. PutCF: TXN META
is the transaction metadata, with its start time, isolation level, etc. -
2. PutCF: SubDocKey(DocKey(0xeda9, [1], []), [SystemColumnId(0)])
is the row entry, as we have seen in the fast path, with additional lock information ([kStrongRead, kStrongWrite]
for the intention to read and write) and thetransactionId
which references the metadata above -
3. PutCF: TXN REV
adds the previous operation to the reverse index.
We will have the same pattern for all changes so that each operation references the transaction and the transaction has a list of operations.
To summarize...
What we call the "fast-path" has the following characteristics:
- is a single-shard transaction (as opposed to distributed transaction)
- concerns single-row transactions. In future versions, we may do the same for multi-row transactions that are single-shard.
- is transparent, detected automatically (when there's only one row)
- bypasses IntentsDB completely (the provisional records and the final cleanup)
- doesn't need to update a transaction table (the write's raft commit is the same as the transaction's SQL commit)
- is close to NoSQL performance (as NoSQL gets this performance by refusing any feature that would require multi-shard operations)
By tracing the writes in a lab, we can see that the distributed transactions do the same but with intermediate states and more work. In short, YugabyteDB provides all SQL features but when the transactions are as simple as NoSQL ones, it can provide the same performance as NoSQL, transparently, and still on a relational or document schema.