Active Session History (ASH) in YugabyteDB

Franck Pachot - Jun 18 - - Dev Community

The latest versions of YugabyteDB have a new instrumentation feature to troubleshoot performance: Active Session History, which gathers information about the threads running in the Tablet Servers and samples it. You need to set the following:

--allowed_preview_flags_csv=ysql_yb_ash_enable_infra,ysql_yb_enable_ash
--ysql_yb_ash_enable_infra=true
--ysql_yb_enable_ash=true

Enter fullscreen mode Exit fullscreen mode

I'll probably refine my queries on it in the future, but for the moment, I've created a function that can gather the samples for the last minute from all tablet servers. I run select * from gv$ash(seconds=>60); and get the following:

psql (16.2, server 11.2-YB-2.21.1.0-b0)
Type "help" for help.

yugabyte=# select * from gv$ash(seconds=>60);
 samples | #req | #rpc | #ysql | component | event_type | event_class |             wait_event             |                         info                         |    host    | zone  | region | cloud | secs
---------+------+------+-------+-----------+------------+-------------+------------------------------------+------------------------------------------------------+------------+-------+--------+-------+------
      56 |   56 |    0 |     1 | YSQL      | Cpu        | YSQLQuery   | QueryProcessing                    | insert into xxx(value) select generate_series($1,$2) | 10.0.0.181 | zone1 | fra    | cloud |   60
      26 |    1 |   26 |     1 | TServer   | Network    | Client      | YBClient_WaitingOnDocDB            |                                                      | 10.0.0.181 | zone1 | fra    | cloud |   59
      17 |    1 |   17 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: 6b4edd419b554eb                           | 10.0.0.181 | zone1 | fra    | cloud |   43
      14 |    1 |   14 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 9c885e587b73499                           | 10.0.0.182 | zone2 | fra    | cloud |   52
      11 |    1 |    9 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: e9e5e29943864ba                           | 10.0.0.183 | zone3 | fra    | cloud |   58
       9 |    1 |    9 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       |                                                      | 10.0.0.181 | zone1 | fra    | cloud |   54
       9 |    1 |    9 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 9c885e587b73499                           | 10.0.0.181 | zone1 | fra    | cloud |   52
       9 |    1 |    9 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: aab25ce7209043f                           | 10.0.0.183 | zone3 | fra    | cloud |   59
       7 |    1 |    7 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 6b4edd419b554eb                           | 10.0.0.182 | zone2 | fra    | cloud |   21
       7 |    1 |    7 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      |                                                      | 10.0.0.181 | zone1 | fra    | cloud |   53
       7 |    1 |    7 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       |                                                      | 10.0.0.183 | zone3 | fra    | cloud |   56
       6 |    2 |    4 |     2 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: e9e5e29943864ba                           | 10.0.0.182 | zone2 | fra    | cloud |    1
       6 |    1 |    6 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: ae8cdaebee6e4ad                           | 10.0.0.182 | zone2 | fra    | cloud |   51
       6 |    2 |    6 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |    4
       5 |    5 |    5 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |   37
       5 |    1 |    4 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: e9e5e29943864ba                           | 10.0.0.181 | zone1 | fra    | cloud |   24
       5 |    5 |    5 |     1 | TServer   | Network    | TabletWait  | ConflictResolution_ResolveConficts | tablet id: aab25ce7209043f                           | 10.0.0.182 | zone2 | fra    | cloud |   18
       5 |    5 |    5 |     2 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: aab25ce7209043f                           | 10.0.0.182 | zone2 | fra    | cloud |   49
       5 |    4 |    5 |     2 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: aab25ce7209043f                           | 10.0.0.182 | zone2 | fra    | cloud |    5
       5 |    1 |    5 |     1 | TServer   | Cpu        | Consensus   | Raft_ApplyingEdits                 | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |   26
       5 |    1 |    5 |     1 | TServer   | Cpu        | Consensus   | Raft_ApplyingEdits                 | tablet id: 9c885e587b73499                           | 10.0.0.183 | zone3 | fra    | cloud |   26
       5 |    1 |    5 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: ae8cdaebee6e4ad                           | 10.0.0.181 | zone1 | fra    | cloud |   45
       5 |    1 |    5 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: aab25ce7209043f                           | 10.0.0.181 | zone1 | fra    | cloud |   40
       5 |    1 |    5 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 6b4edd419b554eb                           | 10.0.0.183 | zone3 | fra    | cloud |   15
       4 |    4 |    4 |     1 | TServer   | Network    | TabletWait  | ConflictResolution_ResolveConficts | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |   40
       4 |    1 |    4 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: df10b3d89f3349f                           | 10.0.0.183 | zone3 | fra    | cloud |   29
       3 |    1 |    3 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: 6b4edd419b554eb                           | 10.0.0.183 | zone3 | fra    | cloud |   38
       3 |    1 |    3 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 8cefb385bcdc44b                           | 10.0.0.183 | zone3 | fra    | cloud |   44
       3 |    2 |    3 |     2 | TServer   | Cpu        | Consensus   | Raft_ApplyingEdits                 | tablet id: aab25ce7209043f                           | 10.0.0.182 | zone2 | fra    | cloud |    2
       3 |    3 |    3 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: 9c885e587b73499                           | 10.0.0.183 | zone3 | fra    | cloud |    7
       3 |    1 |    3 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: b63700bea0ef4b0                           | 10.0.0.181 | zone1 | fra    | cloud |    7
       3 |    3 |    3 |     1 | TServer   | Network    | TabletWait  | ConflictResolution_ResolveConficts | tablet id: 9c885e587b73499                           | 10.0.0.183 | zone3 | fra    | cloud |   13
       3 |    3 |    0 |     1 | YSQL      | Network    | TServerWait | StorageFlush                       | insert into xxx(value) select generate_series($1,$2) | 10.0.0.181 | zone1 | fra    | cloud |   23
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: 6b4edd419b554eb                           | 10.0.0.181 | zone1 | fra    | cloud |   25
       2 |    1 |    2 |     1 | TServer   | Cpu        | Consensus   | Raft_ApplyingEdits                 | tablet id: e9e5e29943864ba                           | 10.0.0.182 | zone2 | fra    | cloud |   28
       2 |    1 |    2 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: df10b3d89f3349f                           | 10.0.0.181 | zone1 | fra    | cloud |    6
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: df10b3d89f3349f                           | 10.0.0.182 | zone2 | fra    | cloud |    7
       2 |    2 |    2 |     1 | TServer   | DiskIO     | RocksDB     | RocksDB_NewIterator                | tablet id: aab25ce7209043f                           | 10.0.0.182 | zone2 | fra    | cloud |   51
       2 |    2 |    2 |     1 | TServer   | DiskIO     | RocksDB     | RocksDB_NewIterator                | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |    9
       2 |    1 |    1 |     1 | TServer   | DiskIO     | TabletWait  | SaveRaftGroupMetadataToDisk        |                                                      | 10.0.0.181 | zone1 | fra    | cloud |    1
       2 |    1 |    1 |     1 | TServer   | DiskIO     | TabletWait  | SaveRaftGroupMetadataToDisk        |                                                      | 10.0.0.182 | zone2 | fra    | cloud |    1
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: ae8cdaebee6e4ad                           | 10.0.0.182 | zone2 | fra    | cloud |    5
       2 |    1 |    1 |     1 | TServer   | DiskIO     | TabletWait  | SaveRaftGroupMetadataToDisk        |                                                      | 10.0.0.183 | zone3 | fra    | cloud |    1
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: 8cefb385bcdc44b                           | 10.0.0.181 | zone1 | fra    | cloud |    0
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       |                                                      | 10.0.0.182 | zone2 | fra    | cloud |    1
       2 |    1 |    2 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: 8cefb385bcdc44b                           | 10.0.0.181 | zone1 | fra    | cloud |    6
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: b63700bea0ef4b0                           | 10.0.0.182 | zone2 | fra    | cloud |    2
       2 |    1 |    2 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: eb7069a1649144e                           | 10.0.0.183 | zone3 | fra    | cloud |    4
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: 9c885e587b73499                           | 10.0.0.181 | zone1 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: 6885238820ff401                           | 10.0.0.182 | zone2 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: ae8cdaebee6e4ad                           | 10.0.0.183 | zone3 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: eb7069a1649144e                           | 10.0.0.182 | zone2 | fra    | cloud |    0
       1 |    1 |    0 |     1 | YSQL      | Extension  | YSQLQuery   | Extension                          |                                                      | 10.0.0.181 | zone1 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Network    | Consensus   | Raft_WaitingForReplication         | tablet id: eb7069a1649144e                           | 10.0.0.181 | zone1 | fra    | cloud |    0
       1 |    1 |    0 |     1 | YSQL      | Network    | TServerWait | CatalogRead                        | select * from gv$ash(seconds=>$1)                    | 10.0.0.181 | zone1 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Active                       | tablet id: b63700bea0ef4b0                           | 10.0.0.183 | zone3 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: ae8cdaebee6e4ad                           | 10.0.0.181 | zone1 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: 9c885e587b73499                           | 10.0.0.183 | zone3 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      | tablet id: e9e5e29943864ba                           | 10.0.0.182 | zone2 | fra    | cloud |    0
       1 |    1 |    1 |     1 | TServer   | Cpu        | Common      | OnCpu_Passive                      |                                                      | 10.0.0.183 | zone3 | fra    | cloud |    0
(60 rows)
Enter fullscreen mode Exit fullscreen mode

The YSQL layer is running an insert statement and the tablet identifiers can be used to find the table or index in http://yb-master:7000/dump-entities

ASH samples are visible from yb_active_session_history on each tablet server. I use the Foreign Data Wrapper to consolidate from all servers. Here is how I created the gv$ash() function and all dependent objects (FDW server, foreign tables, and views):


create extension if not exists postgres_fdw;
select format('
 create server if not exists "gv$%1$s"
 foreign data wrapper postgres_fdw
 options (host %2$L, port %3$L, dbname %4$L)
 ', host, host, port, current_database()) from yb_servers();
\gexec
select format('
 drop user mapping if exists for admin
 server "gv$%1$s"
 ',host) from yb_servers();
\gexec
select format('
 create user mapping if not exists for current_user
 server "gv$%1$s"
 --options ( user %2$L, password %3$L )
 ',host, 'yugabyte', 'SECRET')
 from yb_servers();
\gexec
select format('
 drop schema if exists "gv$%1$s" cascade
 ',host) from yb_servers();
\gexec
select format('
 create schema if not exists "gv$%1$s"
 ',host) from yb_servers();
\gexec
select format('
 import foreign schema "pg_catalog"
 limit to ("yb_active_session_history","pg_stat_statements")
 from server "gv$%1$s" into "gv$%1$s"
 ', host) from yb_servers();
\gexec

with views as (
select distinct foreign_table_name
from information_schema.foreign_tables t, yb_servers() s
where foreign_table_schema = format('gv$%1$s',s.host)
)
select format('drop view if exists "gv$%1$s"', foreign_table_name) from views
union all
select format('create or replace view public."gv$%2$s" as %1$s',
 string_agg(
 format('
 select %2$L as gv$host, %3$L as gv$zone, %4$L as gv$region, %5$L as gv$cloud,
 * from "gv$%2$s".%1$I
 ', foreign_table_name, host, zone, region, cloud)
 ,' union all '), foreign_table_name
) from views, yb_servers() group by views.foreign_table_name ;
\gexec

drop function if exists gv$ash;
create or replace function public.gv$ash(seconds interval default '60 seconds')
RETURNS TABLE (
    samples real,
    "#req" bigint,
    "#rpc" bigint,
    "#ysql" bigint,
    component text,
    event_type text,
    event_class text,
    wait_event text,
    info text,
    host text,
    zone text,
    region text,
    cloud text,
    secs int
)
as $$
select sum(sample_weight) as samples
 , count(distinct root_request_id) as "#req"
 , count(distinct rpc_request_id) as "#rpc"
 , count(distinct ysql_session_id) as "#ysql"
 , wait_event_component as component, wait_event_type as event_type
 , wait_event_class as event_class, wait_event
 , coalesce ( 'tablet_id: '||wait_event_aux, substr(query,1,60) ) as info
 , h.gv$host, h.gv$zone, h.gv$region, h.gv$cloud
 , extract(epoch from max(sample_time)-min(sample_time))::int as secs
 from gv$yb_active_session_history h
 left outer join gv$pg_stat_statements s
 on s.gv$host=h.gv$host and s.queryid=h.query_id
where
 sample_time>now()-seconds
group by
 wait_event_component, wait_event_type, wait_event_class, wait_event
 , wait_event_aux, substr(query,1,60)
 , h.gv$host, h.gv$zone, h.gv$region, h.gv$cloud
order by 1 desc
;
$$ language sql;
select * from gv$ash();

Enter fullscreen mode Exit fullscreen mode

The user and password are hardcoded here (yugabyte), but you can create your own user mapping to each server.

The list of wait event is documented:

Active Session History capability in YugabyteDB | YugabyteDB Docs

Use Active Session History to get current and past views of the database system activity.

favicon docs.yugabyte.com

My function shows the overall picture and you can query yb_active_session_history to get the details of all samples. There's a request ID that helps to match activity from different layers.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .