This blogpost is about the investigation of a warning message saying Fsync took a long time
that went away in recent versions of YugabyteDB.
Summary
(spoiler alert!) The short version is that we still perform fsync()
, because that is the guaranteed way for any application to tell the operating system and the hardware to persist all the dirty blocks/extents for a given inode. However, performing fsync()
on an inode is now delegated to a dedicated thread (the log-sync
thread), and only if this thread cannot perform its work fast enough performing fsync()
, it will fallback and be performed by the original thread that performs the logging called the append
thread.
The logging of fsync()
when it exceeds the set threshold of 50 milliseconds still happens, however in the log-sync
thread the severity has been changed from WARNING to INFO, which means that if you filter on the WARNING severity you don't see it. The INFO level messages can become quite a lot, which is why it makes sense to filter it. Also, the message additionally is throttled to be shown at a maximum rate of 1 per second.
Past situation
When a batch load is performed, and I filter for took a long time
, this is what I see with YugabyteDB version 2.13.2.0:
➜ yb_stats --tail-log --hostname-match 9000 --stat-name-match 'took a long time'
192.168.66.82:9000 2023-02-09 13:31:59.697006 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.060s user 0.001s sys 0.000s
192.168.66.80:9000 2023-02-09 13:32:04.205387 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.051s user 0.000s sys 0.001s
192.168.66.80:9000 2023-02-09 13:32:08.263603 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.057s user 0.001s sys 0.001s
192.168.66.81:9000 2023-02-09 13:32:08.980399 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.079s user 0.000s sys 0.000s
192.168.66.82:9000 2023-02-09 13:32:09.025662 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.107s user 0.000s sys 0.001s
192.168.66.80:9000 2023-02-09 13:32:09.323518 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.206s user 0.001s sys 0.000s
192.168.66.81:9000 2023-02-09 13:32:09.325979 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.065s user 0.000s sys 0.000s
192.168.66.80:9000 2023-02-09 13:32:09.336320 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.209s user 0.000s sys 0.000s
192.168.66.82:9000 2023-02-09 13:32:09.336520 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.174s user 0.000s sys 0.002s
192.168.66.81:9000 2023-02-09 13:32:11.481073 UTC W log.cc:961 Time spent Fsync log took a long time: real 0.051s user 0.000s sys 0.001s
Current situation
When a batch load is performed with a recent version, and I filter for took a long time
(I am using 2.17.1.0), I see some message for log roll
:
❯ yb_stats --tail-log --hostname-match 9000 --stat-name-match 'took a long time'
Tail log ready, showing severities: WEF
192.168.66.80:9000 2023-02-09 14:07:32.139693 UTC W log.cc:731 Time spent T 08372aa6c40c4a0c89dd3692f0b09563 P 1f8a43d3668d4235a990e6045d4f59db: Log roll took a long time: real 0.051s user 0.000s sys 0.000s
192.168.66.82:9000 2023-02-09 14:07:32.147421 UTC W log.cc:731 Time spent T 58d3266d12b0427c9a893c4529506d43 P cea6e8a0388749d18ea9496401608c4e: Log roll took a long time: real 0.053s user 0.001s sys 0.001s
192.168.66.82:9000 2023-02-09 14:07:32.157840 UTC W log.cc:731 Time spent T 08372aa6c40c4a0c89dd3692f0b09563 P cea6e8a0388749d18ea9496401608c4e: Log roll took a long time: real 0.052s user 0.000s sys 0.001s
So, the first thing to look for is whether we are still doing fsync()
or not.
Finding the fsync() call again
Seeing if the fsync()
system call is still performed is not very hard. Traditionally looking at system calls is done using the linux strace
command, or the equivalents on the other operating systems. Today, this can be done too by perf, using: perf trace
Using it is as simple as executing:
$ perf trace -e fsync
And running a batch load for which a tablet is touched that has its leader on the local machine on which the perf trace
execute is run. This is how that looks like:
$ sudo perf trace -e fsync
0.000 ( 3.612 ms): append [worker/8859 fsync(fd: 124) = 0
17.115 ( 1.835 ms): append [worker/7164 fsync(fd: 85) = 0
64.119 (14.810 ms): append [worker/8858 fsync(fd: 282) = 0
82.983 ( 7.721 ms): append [worker/8870 fsync(fd: 283) = 0
104.238 (17.324 ms): append [worker/8871 fsync(fd: 298) = 0
269.817 ( 6.451 ms): log-sync [work/7162 fsync(fd: 282) = 0
300.490 (12.274 ms): log-sync [work/7162 fsync(fd: 283) = 0
326.252 (10.503 ms): log-sync [work/7162 fsync(fd: 298) = 0
486.770 (10.359 ms): log-sync [work/7162 fsync(fd: 282) = 0
523.031 (17.058 ms): log-sync [work/7162 fsync(fd: 298) = 0
552.080 (29.121 ms): log-sync [work/7162 fsync(fd: 283) = 0
762.369 ( ): log-sync [work/7162 fsync(fd: 283) ...
769.427 ( ): log-sync [work/8873 fsync(fd: 298) ...
762.369 (54.809 ms): log-sync [work/7162 ... [continued]: fsync()) = 0
769.427 (49.362 ms): log-sync [work/8873 ... [continued]: fsync()) = 0
779.805 (45.496 ms): log-sync [work/8874 fsync(fd: 282) = 0
Simple conclusion is we are still performing fsync()
, that becomes very much clear from the perf trace
output. If you look at the latency times, it's also visible that on this system (a tiny test system) the latency still can exceed the threshold of 50ms. So why don't I see a message indicating this?
Digging deeper: understanding where the fsync happens
To try to find out more, we need to see what is going on in the code. This is possible, because YugabyteDB is open source. However, we need to figure out where to look.
There are several ways of doing this. One way that is reasonably simple to do, is to let perf trace
print a backtrace/stacktrace with each occurrence of the system call. Once we
17865.004 (16.523 ms): log-sync [work/9892 fsync(fd: 309) = 0
__fsync_nocancel (inlined)
yb::(anonymous namespace)::PosixWritableFile::Sync (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
yb::log::Log::DoSync (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
yb::log::Log::DoSyncAndResetTaskInQueue (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
yb::ThreadPool::DispatchThread (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
yb::Thread::SuperviseThread (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
start_thread (/opt/yugabyte/yugabyte-2.17.1.0/linuxbrew/lib/libpthread-2.23.so)
__clone (/opt/yugabyte/yugabyte-2.17.1.0/linuxbrew/lib/libc-2.23.so)
The function DoSyncAndResetTaskInQueue
can be found here, which calls DoSync
, for which the relevant part is this:
LOG_SLOW_EXECUTION_EVERY_N_SECS(INFO, /* log at most one slow execution every 1 sec */ 1,
50, "Fsync log took a long time") {
SCOPED_LATENCY_METRIC(metrics_, sync_latency);
status = active_segment_->Sync();
}
A lot of things to see and say here:
-
LOG_SLOW_EXECUTION_EVERY_N_SECS
: this means it's not actually logging every occurrence, but rather 'EVERY_N_SECS', which is set to 1 here. -
INFO
: Ah! The log level has been set from WARNING to INFO for the message. -
50
: the threshold is still 50ms, this didn't change from the previous fsync that happened inside the critical path. -
status = active_segment_->Sync()
: this is the actual call to fsync.
Displaying the Fsync took a long time messages again
With what we've learned from visiting the code, it's rather easy now to display the Fsync took a long time
again:
➜ yb_stats --tail-log --hostname-match 9000 --log-severity IWEF --stat-name-match 'took a long time'
Tail log ready, showing severities: IWEF
192.168.66.80:9000 2023-02-09 15:24:25.850107 UTC W log.cc:731 Time spent T 58d3266d12b0427c9a893c4529506d43 P 1f8a43d3668d4235a990e6045d4f59db: Log roll took a long time: real 0.057s user 0.000s sys 0.001s
192.168.66.80:9000 2023-02-09 15:24:25.931238 UTC I log.cc:1001 Time spent Fsync log took a long time: real 0.061s user 0.000s sys 0.002s
192.168.66.80:9000 2023-02-09 15:24:36.632426 UTC I log.cc:1001 Time spent Fsync log took a long time: real 0.065s user 0.000s sys 0.001s
192.168.66.81:9000 2023-02-09 15:24:39.925721 UTC I log.cc:1001 Time spent Fsync log took a long time: real 0.055s user 0.000s sys 0.002s
192.168.66.81:9000 2023-02-09 15:24:45.339032 UTC W log.cc:731 Time spent T 58d3266d12b0427c9a893c4529506d43 P c4ba4bb2cea04a2eade78bed94406fb9: Log roll took a long time: real 0.052s user 0.000s sys 0.001s
It's simply adding the I
to the log severities to be displayed. However, there is another thing that we've learned from reading the code, which is that not all fsync()
calls that took more than 50ms are displayed. So there could be more, or not.