YugabyteDB advanced analysis: nonmetric diff

Frits Hoogland - Jan 31 '23 - - Dev Community

One of the most important things a database administrator needs to do is understand what is going on and what has changed. yb_stats traditionally helped to fetch metrics from the metric endpoints to understand what is going on from the perspective of the metrics.

However, yb_stats now has the ability to read the principal YugabyteDB cluster components (the masters and the tablet servers), and read the status and configuration, and show the difference between these at an earlier point of time.

Both 'adhoc' (online) as well as using snapshots.

What does that mean? Let me show some examples:

Master leader change

A typical cluster has a replication factor of 3, which means there are 3 masters, of which one if LEADER and two are FOLLOWER.

This can be see in the web UI on any master, or using yb_stats --print-masters:

➜ yb_stats --print-masters
c5d0f3ff5e824bb7b637ee16b3a3aa63 FOLLOWER Placement: local.local.local
                                 Seqno: 1675087212541781 Start time: 1675087212541781
                                 RPC addresses: ( yb-1.local:7100 )
                                 HTTP addresses: ( yb-1.local:7000 )
6bd3f8363cbc4ce69b00cca4855babab LEADER Placement: local.local.local
                                 Seqno: 1675087269122547 Start time: 1675087269122547
                                 RPC addresses: ( yb-2.local:7100 )
                                 HTTP addresses: ( yb-2.local:7000 )
fce214eaeb1841c7b625e23e9a8223cb FOLLOWER Placement: local.local.local
                                 Seqno: 1675087329989684 Start time: 1675087329989684
                                 RPC addresses: ( yb-3.local:7100 )
                                 HTTP addresses: ( yb-3.local:7000 )
Enter fullscreen mode Exit fullscreen mode

Here the node yb-2.local serves the current master LEADER.

Now perform the following action:

➜ yb_stats --adhoc-nonmetrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
Enter fullscreen mode Exit fullscreen mode

Then take another shell (this is a shell on the cluster), and change the master leader to another node:

[vagrant@yb-3 ~]$ yb-admin -init_master_addrs=localhost:7100 master_leader_stepdown c5d0f3ff5e824bb7b637ee16b3a3aa63
Enter fullscreen mode Exit fullscreen mode

Now head over the the yb_stats session, and press enter:

Time between snapshots:  200.602 seconds
= Masters:  6bd3f8363cbc4ce69b00cca4855babab Role: LEADER->FOLLOWER Placement: local.local.local
                                             Seq#: 1675087269122547 Start time: 1675087269122547
                                             RPC: yb-2.local:7100,
                                             HTTP: yb-2.local:7000,
= Masters:  c5d0f3ff5e824bb7b637ee16b3a3aa63 Role: FOLLOWER->LEADER Placement: local.local.local
                                             Seq#: 1675087212541781 Start time: 1675087212541781
                                             RPC: yb-1.local:7100,
                                             HTTP: yb-1.local:7000,
Enter fullscreen mode Exit fullscreen mode

This tells that master UUID 6bd3f8363cbc4ce69b00cca4855babab gone from LEADER to FOLLOWER, and master UUID c5d0f3ff5e824bb7b637ee16b3a3aa63 gone from FOLLOWER to LEADER.

Nothing else changed.

Tablet server reboot

How about a situation where a tablet server had an issue?
Let's simulate that:

Set the non-metrics diff to capture the begin situation:

➜ yb_stats --adhoc-nonmetrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
Enter fullscreen mode Exit fullscreen mode

And take another shell on a node, and for the sake of the test, restart a tablet server:

[vagrant@yb-3 ~]$ sudo systemctl restart yb-tserver
Enter fullscreen mode Exit fullscreen mode

Now go to the yb_stats session, and press enter to see the result:

Time between snapshots:   16.405 seconds
= Tablet:   ycql.system.transactions.0fa815a60bf8447798528c1e12aa9db9, state: RUNNING leader: yb-3.local:9100->yb-2.local:9100
= Tablet:   ycql.system.transactions.13801aa4464448f294b6280a610d5314, state: RUNNING leader: yb-3.local:9100->yb-2.local:9100
= Tablet:   ycql.system.transactions.143f9753b61a4b0ca9bc121fcb35b046, state: RUNNING leader: yb-2.local:9100->yb-3.local:9100
= Tablet:   ycql.system.transactions.1ae4ec3851d54bd182a8ba38462269ba, state: RUNNING leader: yb-2.local:9100->yb-3.local:9100
= Tablet:   ycql.system.transactions.307d79c0d896492882141942f93d54a1, state: RUNNING leader: yb-3.local:9100->yb-1.local:9100
= Tablet:   ycql.system.transactions.4c50f31ccb484394937b7634377cd459, state: RUNNING leader: yb-1.local:9100->yb-3.local:9100
= Tablet:   ycql.system.transactions.572d5b78a740458389be56e9f803ee14, state: RUNNING leader: yb-1.local:9100->yb-3.local:9100
= Tablet:   ycql.system.transactions.619bc587c0f048a7bbe3994d3fe73cce, state: RUNNING leader: yb-3.local:9100->yb-1.local:9100
= Tablet:   ycql.system.transactions.64edb721a0ed4d209c4fa5b860193e5f, state: RUNNING leader: yb-1.local:9100->yb-3.local:9100
= Tablet:   ycql.system.transactions.be079d77ff5d4a77bc0c41edb62632d1, state: RUNNING leader: yb-3.local:9100->yb-1.local:9100
= Tserver:  yb-3.local:9000, status: ALIVE, uptime: 70->5
= 192.168.66.82:12000  Vars: heap_profile_path                                  /tmp/yb-tserver.10057->/tmp/yb-tserver.10205 Default
= 192.168.66.82:9000   Vars: heap_profile_path                                  /tmp/yb-tserver.10057->/tmp/yb-tserver.10205 Default
Enter fullscreen mode Exit fullscreen mode

Quite interesting, simply restarting a tablet server seems to change a lot:

  • A lot of ycql.system.transaction tablets changed leader. Despite it saying 'ycql', these are the transaction tablets that both YCQL as well as YSQL use for registering transactions.
  • The tablet server I restarted is found to be changed by looking at it's uptime.
  • The restart dynamically changes the gflag heap_profile_path.

Database changes

Another thing that is very interesting for a database administrator, is to be able to see if databases are created and or objects in them.

Simply set the non-metrics diff to capture the begin situation:

➜ yb_stats --adhoc-nonmetrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
Enter fullscreen mode Exit fullscreen mode

And for the sake of the demonstration create a table with an index, and create a colocated database:

yugabyte=# create table demo (id int primary key, f1 text);
CREATE TABLE
yugabyte=# create index demo_i on demo(f1);
CREATE INDEX
yugabyte=# create database demodb with colocated=true;
CREATE DATABASE
Enter fullscreen mode Exit fullscreen mode

And head over to yb_stats shell, and press enter:

Time between snapshots:  119.374 seconds
+ Database: ysql.demodb, id: 00004106000030008000000000000000 [colocated]
+ Object:   ysql.yugabyte.demo, state: RUNNING, id: 000033e8000030008000000000004100
+ Object:   ysql.yugabyte.demo_i, state: RUNNING, id: 000033e8000030008000000000004105
+ Tablet:   ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, state: RUNNING, leader: yb-3.local:9100
+ Tablet:   ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, state: RUNNING, leader: yb-3.local:9100
+ Tablet:   ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, state: RUNNING, leader: yb-3.local:9100
+ Tablet:   ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, state: RUNNING, leader: yb-2.local:9100
+ Tablet:   ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, state: RUNNING, leader: yb-2.local:9100
+ Tablet:   ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, state: RUNNING, leader: yb-1.local:9100
+ Tablet:   ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, state: RUNNING, leader: yb-1.local:9100
+ Replica:  yb-1.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
+ Replica:  yb-1.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
+ Replica:  yb-2.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
+ Replica:  yb-3.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
Enter fullscreen mode Exit fullscreen mode

Yep, lots of output, because there simply are a lot of parts on multiple layers in a YugabyteDB cluster. The important part is you can see what has been changed. In this case, this means the different objects that are added.

Non-metrics diff via snapshots

You probably get the idea, just for completeness sake: let's perform a non-metrics diff exercise using snapshots instead of ad-hoc:

Create a snapshot:

➜ yb_stats --snapshot
snapshot number 2
Enter fullscreen mode Exit fullscreen mode

Then perform some action:

yugabyte=# drop database demodb;
DROP DATABASE
yugabyte=# drop table demo;
DROP TABLE
Enter fullscreen mode Exit fullscreen mode

And snapshot again:

➜ yb_stats --snapshot
snapshot number 3
Enter fullscreen mode Exit fullscreen mode

And perform a non-metrics diff based on the snapshots:

➜ yb_stats --snapshot-nonmetrics-diff -b 2 -e 3
- Database: ysql.demodb, id: 00004106000030008000000000000000 [colocated]
- Object:   ysql.yugabyte.demo, state: RUNNING, id: 000033e8000030008000000000004100
- Object:   ysql.yugabyte.demo_i, state: RUNNING, id: 000033e8000030008000000000004105
- Tablet:   ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, state: RUNNING, leader: yb-3.local:9100
- Tablet:   ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, state: RUNNING, leader: yb-3.local:9100
- Tablet:   ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, state: RUNNING, leader: yb-3.local:9100
- Tablet:   ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, state: RUNNING, leader: yb-2.local:9100
- Tablet:   ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, state: RUNNING, leader: yb-2.local:9100
- Tablet:   ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, state: RUNNING, leader: yb-1.local:9100
- Tablet:   ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, state: RUNNING, leader: yb-1.local:9100
- Replica:  yb-1.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
- Replica:  yb-2.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
- Replica:  yb-3.local:9100:ysql.demodb.00004106000030008000000000000000.colocated.parent.tablename.0a7557258b614e489e6979998781d4a3, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.12d8c9d1a2444b05b90eed2869306488, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo.22f2922e3921458c8bf1624f76bddcf5, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.24f8d503298e42e197fb1b88f788445d, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo.413cb64b95264d3fa5215dcc8932728c, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo_i.cb98c31390dd492b89a08fc8811deb4e, Type: VOTER
- Replica:  yb-1.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
- Replica:  yb-2.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
- Replica:  yb-3.local:9100:ysql.yugabyte.demo.cccbfc59712c478588262c56a9dcf175, Type: VOTER
Enter fullscreen mode Exit fullscreen mode

Here we see the (colocated) database that was created has been removed ("-"), and the table "demo" having been removed.

Because of these being removed, this causes the index on the table to be removed too. And all of that causes the tablets that these used to be removed, which causes the replicas to be removed too.

Conclusion

Using non-metric diffs, it's possible to see changes to the YugabyteDB database, which is or can be very helpful when investigating a YugabyteDB cluster/universe.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .