Online Rolling Upgrade in YugabyteDB Managed

Franck Pachot - Sep 24 '23 - - Dev Community

Monolithic databases must be halted and restarted when their servers require maintenance, such as OS or database upgrades, for the duration of the maintenance or, at the very least, until the transition to a standby replica is completed. This results in application downtime.

YugabyteDB demonstrates resilience when nodes are halted, enabling seamless rolling upgrades without interrupting the application. Nonetheless, it remains advisable to establish a maintenance window, preferably during periods of lower activity.

One of my YugabyteDB-managed clusters requires an OS upgrade.

Upcoming Maintenance
Maintenance is scheduled for tomorrow, but I have the option to perform the upgrade immediately.
Maintenance Window

Before clicking the "Upgrade Now" button, I initiated a loop of 30-second-duration pgbench tests to demonstrate the impact on resilience. It connects to the load balancer.

while true
do 
 pgbench -nN -c 10 -T30 --max-tries=10
done 2>&1 | ts | tee bench.log

Sep 24 20:37:40 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
Sep 24 20:38:10 transaction type: <builtin: simple update>
Sep 24 20:38:10 scaling factor: 1
Sep 24 20:38:10 query mode: simple
Sep 24 20:38:10 number of clients: 10
Sep 24 20:38:10 number of threads: 1
Sep 24 20:38:10 maximum number of tries: 10
Sep 24 20:38:10 duration: 30 s
Sep 24 20:38:10 number of transactions actually processed: 1594
Sep 24 20:38:10 number of failed transactions: 0 (0.000%)
Sep 24 20:38:10 number of transactions retried: 0 (0.000%)
Sep 24 20:38:10 total number of retries: 0
Sep 24 20:38:10 latency average = 170.637 ms
Sep 24 20:38:10 initial connection time = 2938.506 ms
Sep 24 20:38:10 tps = 58.604070 (without initial connection time)
Sep 24 20:38:10 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
Sep 24 20:38:40 transaction type: <builtin: simple update>
Sep 24 20:38:40 scaling factor: 1
Sep 24 20:38:40 query mode: simple
Sep 24 20:38:40 number of clients: 10
Sep 24 20:38:40 number of threads: 1
Sep 24 20:38:40 maximum number of tries: 10
Sep 24 20:38:40 duration: 30 s
Sep 24 20:38:40 number of transactions actually processed: 1572
Sep 24 20:38:40 number of failed transactions: 0 (0.000%)
Sep 24 20:38:40 number of transactions retried: 0 (0.000%)
Sep 24 20:38:40 total number of retries: 0
Sep 24 20:38:40 latency average = 172.880 ms
Sep 24 20:38:40 initial connection time = 2978.176 ms
Sep 24 20:38:40 tps = 57.843725 (without initial connection time)
Sep 24 20:38:41 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
...
Sep 24 20:39:11 tps = 59.212379 (without initial connection time)
Sep 24 20:39:12 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
...
Sep 24 20:39:42 tps = 58.289579 (without initial connection time)
Sep 24 20:39:42 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
...
Sep 24 20:40:13 tps = 59.049061 (without initial connection time)
Sep 24 20:40:13 pgbench (16.0, server 11.2-YB-2.18.2.1-b0)
...
Enter fullscreen mode Exit fullscreen mode

This results in approximately 1800 simple update transactions during each 30-second run. Please note that there is an increased latency due to running pgbench from a different region. I will examine these metrics in the YugabyteDB Managed performance console .
Image description

Now that it's running smoothly, I've initiated the rolling upgrade. It's marked as under maintenance, but the application continues to operate without interruption.
Image description

Here is the log summary after one hour:

$ grep -iE "(tps|retried: [^0]|failed transactions: [^0]|Run was aborted)" bench.log |
  awk '/tps =/{split($3,a,":");t=a[1]*60*60+a[2]*60+a[3];print $0,"+" (t-l) "s";l=t;next}{print}'

Sep 24 20:38:10 tps = 58.604070 (without initial connection time) +74290s
Sep 24 20:38:40 tps = 57.843725 (without initial connection time) +30s
Sep 24 20:39:11 tps = 59.212379 (without initial connection time) +31s
Sep 24 20:39:42 tps = 58.289579 (without initial connection time) +31s
Sep 24 20:40:13 tps = 59.049061 (without initial connection time) +31s
Sep 24 20:40:43 tps = 59.047528 (without initial connection time) +30s
Sep 24 20:41:14 tps = 58.602303 (without initial connection time) +31s
Sep 24 20:41:45 tps = 58.824340 (without initial connection time) +31s
Sep 24 20:42:16 tps = 58.076740 (without initial connection time) +31s
Sep 24 20:42:46 tps = 57.492215 (without initial connection time) +30s
Sep 24 20:43:17 tps = 57.959574 (without initial connection time) +31s
Sep 24 20:43:48 tps = 57.447093 (without initial connection time) +31s
Sep 24 20:44:18 tps = 59.937788 (without initial connection time) +30s
Sep 24 20:44:49 tps = 58.436465 (without initial connection time) +31s
Sep 24 20:45:20 tps = 58.640678 (without initial connection time) +31s
Sep 24 20:45:51 tps = 57.784915 (without initial connection time) +31s
Sep 24 20:46:21 tps = 58.788018 (without initial connection time) +30s
Sep 24 20:46:52 tps = 56.981464 (without initial connection time) +31s
Sep 24 20:47:22 tps = 59.139312 (without initial connection time) +30s
Sep 24 20:47:53 tps = 58.454658 (without initial connection time) +31s
Sep 24 20:48:24 tps = 58.747788 (without initial connection time) +31s
Sep 24 20:48:55 tps = 58.975338 (without initial connection time) +31s
Sep 24 20:49:25 tps = 58.513081 (without initial connection time) +30s
Sep 24 20:49:56 tps = 56.849241 (without initial connection time) +31s
Sep 24 20:50:27 tps = 58.688325 (without initial connection time) +31s
Sep 24 20:50:57 tps = 58.651095 (without initial connection time) +30s
Sep 24 20:51:28 tps = 57.530267 (without initial connection time) +31s
Sep 24 20:51:59 tps = 57.794496 (without initial connection time) +31s
Sep 24 20:52:29 tps = 58.402662 (without initial connection time) +30s
Sep 24 20:53:00 tps = 58.234007 (without initial connection time) +31s
Sep 24 20:53:31 tps = 58.419120 (without initial connection time) +31s
Sep 24 20:54:01 tps = 58.470708 (without initial connection time) +30s
Sep 24 20:54:32 tps = 58.948411 (without initial connection time) +31s
Sep 24 20:55:03 tps = 58.492230 (without initial connection time) +31s
Sep 24 20:55:33 tps = 59.828750 (without initial connection time) +30s
Sep 24 20:56:04 tps = 59.083429 (without initial connection time) +31s
Sep 24 20:56:35 tps = 57.327188 (without initial connection time) +31s
Sep 24 20:57:06 tps = 58.605542 (without initial connection time) +31s
Sep 24 20:57:36 tps = 58.210278 (without initial connection time) +30s
Sep 24 20:58:07 tps = 56.909389 (without initial connection time) +31s
Sep 24 20:58:38 tps = 57.688800 (without initial connection time) +31s
Sep 24 20:59:08 number of transactions retried: 2 (0.128%)
Sep 24 20:59:08 tps = 57.394979 (without initial connection time) +30s
Sep 24 20:59:39 tps = 59.304493 (without initial connection time) +31s
Sep 24 21:00:10 tps = 59.260027 (without initial connection time) +31s
Sep 24 21:00:40 tps = 58.112145 (without initial connection time) +30s
Sep 24 21:01:11 tps = 57.691825 (without initial connection time) +31s
Sep 24 21:01:42 tps = 57.571573 (without initial connection time) +31s
Sep 24 21:02:12 tps = 57.518545 (without initial connection time) +30s
Sep 24 21:02:43 tps = 58.940469 (without initial connection time) +31s
Sep 24 21:03:14 tps = 57.600011 (without initial connection time) +31s
Sep 24 21:03:44 tps = 58.570001 (without initial connection time) +30s
Sep 24 21:04:15 tps = 58.695916 (without initial connection time) +31s
Sep 24 21:04:46 tps = 57.372924 (without initial connection time) +31s
Sep 24 21:05:17 number of transactions retried: 1 (0.063%)
Sep 24 21:05:17 tps = 58.215367 (without initial connection time) +31s
Sep 24 21:05:47 tps = 57.362539 (without initial connection time) +30s
Sep 24 21:06:18 tps = 58.101503 (without initial connection time) +31s
Sep 24 21:06:49 tps = 57.503870 (without initial connection time) +31s
Sep 24 21:07:19 tps = 57.265190 (without initial connection time) +30s
Sep 24 21:07:50 tps = 57.867751 (without initial connection time) +31s
Sep 24 21:08:21 tps = 58.610872 (without initial connection time) +31s
Sep 24 21:08:51 tps = 57.746011 (without initial connection time) +30s
Sep 24 21:09:22 tps = 59.192024 (without initial connection time) +31s
Sep 24 21:09:53 tps = 58.271083 (without initial connection time) +31s
Sep 24 21:10:23 tps = 58.470143 (without initial connection time) +30s
Sep 24 21:12:41 tps = 49.647477 (without initial connection time) +138s
Sep 24 21:13:11 tps = 57.092438 (without initial connection time) +30s
Sep 24 21:13:42 tps = 57.919045 (without initial connection time) +31s
Sep 24 21:14:13 tps = 58.306535 (without initial connection time) +31s
Sep 24 21:14:43 tps = 59.710009 (without initial connection time) +30s
Sep 24 21:15:14 tps = 58.035985 (without initial connection time) +31s
Sep 24 21:15:45 tps = 58.663706 (without initial connection time) +31s
Sep 24 21:16:15 tps = 58.580870 (without initial connection time) +30s
Sep 24 21:16:46 tps = 59.161974 (without initial connection time) +31s
Sep 24 21:17:17 tps = 57.408015 (without initial connection time) +31s
Sep 24 21:17:48 tps = 58.366285 (without initial connection time) +31s
Sep 24 21:18:18 tps = 58.474782 (without initial connection time) +30s
Sep 24 21:18:49 number of transactions retried: 2 (0.126%)
Sep 24 21:18:49 tps = 58.559238 (without initial connection time) +31s
Sep 24 21:19:19 tps = 58.346393 (without initial connection time) +30s
Sep 24 21:19:55 tps = 45.906261 (without initial connection time) +36s
Sep 24 21:19:55 pgbench: error: Run was aborted; the above results are incomplete.
Sep 24 21:20:25 tps = 56.898508 (without initial connection time) +30s
Sep 24 21:21:13 tps = 58.585001 (without initial connection time) +48s
Sep 24 21:21:43 tps = 58.169470 (without initial connection time) +30s
Sep 24 21:22:14 tps = 58.434046 (without initial connection time) +31s
Sep 24 21:22:44 tps = 58.148500 (without initial connection time) +30s
Sep 24 21:23:15 tps = 58.476839 (without initial connection time) +31s
Sep 24 21:23:46 tps = 57.943592 (without initial connection time) +31s
Sep 24 21:24:16 tps = 58.794621 (without initial connection time) +30s
Sep 24 21:24:47 tps = 59.068495 (without initial connection time) +31s
Sep 24 21:25:18 number of transactions retried: 2 (0.125%)
Sep 24 21:25:18 tps = 58.908909 (without initial connection time) +31s
Sep 24 21:28:03 tps = 59.011434 (without initial connection time) +165s
Sep 24 21:28:34 tps = 57.238880 (without initial connection time) +31s
Sep 24 21:29:10 tps = 47.188028 (without initial connection time) +36s
Sep 24 21:29:10 pgbench: error: Run was aborted; the above results are incomplete.
Sep 24 21:29:40 tps = 58.037674 (without initial connection time) +30s
Sep 24 21:30:56 tps = 57.695296 (without initial connection time) +76s
Sep 24 21:31:26 tps = 57.334611 (without initial connection time) +30s
Sep 24 21:31:57 tps = 58.310055 (without initial connection time) +31s
Sep 24 21:32:28 tps = 58.035667 (without initial connection time) +31s
Sep 24 21:32:58 tps = 58.041655 (without initial connection time) +30s
Sep 24 21:33:29 tps = 58.876641 (without initial connection time) +31s
Sep 24 21:34:00 tps = 57.118299 (without initial connection time) +31s
Sep 24 21:34:31 tps = 59.302220 (without initial connection time) +31s
Sep 24 21:35:01 tps = 58.572992 (without initial connection time) +30s
Sep 24 21:35:32 tps = 58.132395 (without initial connection time) +31s
Sep 24 21:36:03 tps = 58.604205 (without initial connection time) +31s
Sep 24 21:36:33 number of transactions retried: 3 (0.191%)
Sep 24 21:36:33 tps = 58.207134 (without initial connection time) +30s
Sep 24 21:37:04 tps = 58.337062 (without initial connection time) +31s
Sep 24 21:37:35 tps = 59.832816 (without initial connection time) +31s
Sep 24 21:38:05 tps = 59.779604 (without initial connection time) +30s
Sep 24 21:38:36 tps = 57.373025 (without initial connection time) +31s
Sep 24 21:39:07 tps = 58.817661 (without initial connection time) +31s
Sep 24 21:39:37 tps = 59.644602 (without initial connection time) +30s
Sep 24 21:40:08 tps = 57.635763 (without initial connection time) +31s
Sep 24 21:40:39 number of transactions retried: 1 (0.062%)
Sep 24 21:40:39 tps = 59.027850 (without initial connection time) +31s
Sep 24 21:41:10 tps = 58.905962 (without initial connection time) +31s
Sep 24 21:41:40 tps = 57.701086 (without initial connection time) +30s
Sep 24 21:42:11 tps = 59.091197 (without initial connection time) +31s
Sep 24 21:42:42 tps = 58.465723 (without initial connection time) +31s
Sep 24 21:43:12 tps = 58.124405 (without initial connection time) +30s
Sep 24 21:43:43 tps = 58.371284 (without initial connection time) +31s
Sep 24 21:44:14 tps = 58.187001 (without initial connection time) +31s
Sep 24 21:44:45 tps = 59.529472 (without initial connection time) +31s
Sep 24 21:45:15 tps = 59.222210 (without initial connection time) +30s
Sep 24 21:45:46 tps = 57.402896 (without initial connection time) +31s
Sep 24 21:46:17 tps = 58.906421 (without initial connection time) +31s
Sep 24 21:46:47 tps = 58.996497 (without initial connection time) +30s
Sep 24 21:47:18 tps = 58.465572 (without initial connection time) +31s
Enter fullscreen mode Exit fullscreen mode

Note that I'm running the least resilient configuration here as my cluster has only 3 nodes, which is the minimum for a Replication Factor 3 cluster. One-third of connections fail when a node is taken down, and one-third of Raft leaders are stepped down. Production clusters usually run with more nodes and the impact of the rolling upgrade is lower.

In the logs, we can observe three brief periods of decreased throughput and instances when pgbench aborted and restarted at 21:12 , 21:19, 21:29. The issue here is that my basic while true loop with pgbench lacks a connection pool, and failed connections are not automatically re-establishing. Consequently, there were many "Connection refused" errors because the load balancer continued to direct traffic to a node that was temporarily down. To increase High Availability, it's advisable to implement a connection pool and utilize smart drivers within your application.

Those periods are clearly discernible in the performance graphs, characterized by a reduction in SQL operations per second. This decline aligns with the moments when pgbench faced disruptions and aborted during its execution:
ops

During those periods, the latency exhibited an increase, primarily due to internal retries of read and write operations within the system:
lat

It's quite straightforward to identify when the nodes experienced downtime by examining the maximum limit, which is set to 60 per node. The windows during which the nodes were down correspond to the moments when the limit dropped from 180 to 120 transactions per node:
con

As evident from the performance metrics, latency glitches occur for only a brief duration, even without any special considerations in the application configuration, using a simple loop of pgbench runs, on a cluster with the minimum number of nodes.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .