With the YugabyteDB database, we persist data by guaranteeing the data changes, such inserts, updates, deletes, etc. to be persisted. As with many databases and other programs that have this requirement, we do this by writing it to a file and call fsync()
.
However, there is a thing that I wondered about for a long time: if I calculate the time spent on the fsync()
call when doing batch oriented work with YugabyteDB, the time spent waiting is often 20ms or more.
This time looks high. And you have to wait for fsync()
, because only if it finishes successfully, the data is guaranteed to be stored on persistent media.
Let's follow the work that is done down into the linux kernel, and see if we can find a reason!
The first step is to see what exactly we do in YugabyteDB. For normal write calls, for which we use the write()
system call, this is problematic because the Yugabyte database is a threaded application, and we use write()
on many occasions, which would give lots and lots of data.
But for writing to the YugabyteDB WAL files, we use the writev()
(vector write) system call. The writev()
alias vector write call allows multiple requests to be submitted in one go synchronously. Below we see the vector writes submitting two write requests each.
So to see the WAL writing and fsync activity, we can use the strace
utility and only show the writev
and fsync()
system calls. This is how that looks like:
$ strace -fp $(pgrep tserver) -e writev,fsync
...
[pid 1995] writev(121, [{iov_base="\20\0\0\0\235+\346\235:\0\222\357", iov_len=12}, {iov_base="\22\5\10\2\20\226U\30\260\261\277\311\242\377\2042", iov_len=16}], 2) = 28
[pid 1995] writev(121, [{iov_base="g\314\30\0\320v\332\237\244)\304\177", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\227U\21\0\2006\373\232\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid 1995] fsync(121) = 0
[pid 1995] writev(121, [{iov_base="\20\0\0\0\337\21N\212\315\357C\300", iov_len=12}, {iov_base="\22\5\10\2\20\227U\30\225\231\350\341\242\377\2042", iov_len=16}], 2) = 28
[pid 1995] writev(121, [{iov_base="g\314\30\0\327]?\33\377\36\3446", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\230U\21\0\320I\10\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid 1995] fsync(121) = 0
[pid 1995] writev(121, [{iov_base="\20\0\0\0\221\f\260\351\367h_\350", iov_len=12}, {iov_base="\22\5\10\2\20\230U\30\251\245\371\375\242\377\2042", iov_len=16}], 2) = 28
[pid 1995] writev(121, [{iov_base="g\314\30\0\26\234\252W\304g\262n", iov_len=12}, {iov_base="\n\323\230c\10\1\22\315\230c\n\5\10\2\20\231U\21\0p\7\24\233\250\311_H\0 \3*\256"..., iov_len=1625191}], 2) = 1625203
[pid 1995] fsync(121) = 0
...
So -in general, not strictly!-, during a batch operation such as a batch insert, there's two writev()
calls, one small call, about 28 bytes (the last number of the total size of the writev() request), and one big, which is 1625203 bytes.
This is where things start to get fuzzy, at least for me. The big write that is performed is seems to be 1625203 bytes, which is a little more than 1.5 MB. The YugabyteDB threshold for performing fsync is 1 second or 1 MB, but this is approximately 50% more?
But still, this is still reasonably known territory, right? We write data buffered, and then perform fsync(), there is no reason that that would take 20ms....or is it?
The next layer to look at is the block IO layer, or "bio" layer. To look at this layer, there are some brilliant tools in the bcc tools package, which use linux eBPF.
The tool to use here is biosnoop
: a tool to 'snoop' on the BIO layer.
This is how that looks like for the same requests for which we looked at with strace
:
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
15.492545 append [worker 1668 nvme1n1 W 107830376 4096 0.69
15.493980 append [worker 1668 nvme1n1 W 107830384 262144 2.08
15.495007 append [worker 1668 nvme1n1 W 107830896 262144 3.08
15.495918 append [worker 1668 nvme1n1 W 107831408 262144 3.95
15.497194 append [worker 1668 nvme1n1 W 107831920 262144 5.19
15.497705 append [worker 1668 nvme1n1 W 107832432 262144 5.67
15.499456 append [worker 1668 nvme1n1 W 107832944 262144 7.39
15.500215 append [worker 1668 nvme1n1 W 107833456 53248 8.14
15.507690 append [worker 1668 nvme1n1 W 104958050 1024 0.77
15.512486 append [worker 1668 nvme1n1 W 107833552 4096 0.59
15.514076 append [worker 1668 nvme1n1 W 107833560 262144 2.14
15.515213 append [worker 1668 nvme1n1 W 107834072 262144 3.24
15.515951 append [worker 1668 nvme1n1 W 107834584 262144 3.95
15.516999 append [worker 1668 nvme1n1 W 107835608 200704 4.90
15.517053 append [worker 1668 nvme1n1 W 107835096 262144 5.01
15.517995 append [worker 1668 nvme1n1 W 107836512 110592 5.85
15.518299 append [worker 1668 nvme1n1 W 107836000 262144 6.17
15.529624 append [worker 1668 nvme1n1 W 104958052 4096 0.62
This snippet from biosnoop
shows the requests as the block IO layer sees it. Above we see the result of two times the combination of writev()/writev()/fsync().
But there are many more requests here? If we carefully look, we see that the first request is a request of 4096 bytes. That is probably the first writev(), although bigger.
But the important thing to see are the requests that follow that, each with a size of 262144 bytes, alias 256 kB.
Each of the bigger requests of 256k shows a latency of approximately 3ms or higher. This is the average IO latency time that I recognise, but you can also clearly see here that smaller requests get a better latency, and thus bigger ones have worse latency.
But where comes the 262144 / 256k come from?
The answer to that comes with the linux kernel files that are part of the disk subsystem and drivers in /sys/block/<blockdevice>/queue
. This is a meta-directory, and contains information and settings about the disk.
The thing to look for is max_sectors_kb
:
# cat max_sectors_kb
256
This means the maximum IO size is 256 kB! So the kernel broke down the large requests in the biggest size it is set to handle, which here is 256 kB!
Let's see if we can increase the IO size: the maximum allowed size is specified in max_hw_sectors_kb
:
# cat max_hw_sectors_kb
256
Nope. We can't increase the IO size, it's already at its maximum. :-(
In case you're wondering what I run on, this is Amazon EC2 c5.
So now combining/going back to the requests that we measured using biosnoop
: these are the requests from the large writev()
request, but are broken down to 256kB chunks because of the maximum IO size, and then submitted to the block device.
Each of these requests takes approximately 3ms or higher, and there seems to be a tendency for successive large IO requests to get increasing latency.
Overall, this explains the fsync() latency that we see, these are the result of the chunks of 256kB that the fsync() call has to dispatch and wait for, which adds up to the 20ms or higher that we see with batches, when we fsync after (more than) 1M.