Troubleshooting a Slow Cluster with a Hot Node

Fabio Ghirardello - Jan 8 '21 - - Dev Community

This blog is about the process of troubleshooting a problematic cluster.

Understanding the Problem

Your cluster is experiencing high latency and spike in CPU usage for some nodes during the load test.
You need to lower latencies and improve CPU utilization to achieve higher throughput.

Take a look at the cluster to gather:

  • the environment configuration: CPUs, MEM, Storage, Networking, node count, node location, CockroachDB version, etc.
  • the database schemas.
  • the SQL queries run as part of the load test.

The Environment

The UAT environment runs on 12 nodes across 4 AZs in 2 regions, US East and US West.

The cluster is on the latest CockroachDB version, on 4 vCPUs/16GB Mem instances with standard storage.

localities

Verify that the latency is minimal (less than 1ms) within zones of the same region.

network-latency

Database Schemas

Here's the schema

CREATE TABLE credits (
    id INT2 NOT NULL,
    code UUID NOT NULL,
    channel STRING(1) NOT NULL,
    pid INT4 NOT NULL,
    end_date DATE NOT NULL,
    status STRING(1) NOT NULL,
    start_date DATE NOT NULL,
    CONSTRAINT "primary" PRIMARY KEY (id ASC, code ASC),
    INDEX credits_pid_idx (pid ASC),
    INDEX credits_code_id_idx (code ASC, id ASC) STORING (channel, status, end_date, start_date),
    FAMILY "primary" (id, code, channel, pid, end_date, status, start_date)
);

CREATE TABLE offers (
    id INT4 NOT NULL,
    code UUID NOT NULL,
    token UUID NOT NULL,
    start_date DATE,
    end_date DATE,
    CONSTRAINT "primary" PRIMARY KEY (id ASC, code ASC, token ASC),
    INDEX offers_token_idx (token ASC),
    FAMILY "primary" (id, code, token)
);
Enter fullscreen mode Exit fullscreen mode

Check how the data is distributed into ranges

SHOW RANGES FROM TABLE credits;
Enter fullscreen mode Exit fullscreen mode
                        start_key                       |                        end_key                        | range_id | range_size_mb | lease_holder |           lease_holder_locality           | replicas |                                                          replica_localities
--------------------------------------------------------+-------------------------------------------------------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
  NULL                                                  | /1/"\x00\x05\x16\x80\xf7\xcbL䣵w\x81\x1a\x1d\xd6\xf6" |       38 |      0.000728 |            3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,8}  | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-b"}
  /1/"\x00\x05\x16\x80\xf7\xcbL䣵w\x81\x1a\x1d\xd6\xf6" | /15/"\x15\xd3\xe7\xb3_\xa9A\"\xb5p\xa2\xf5\xb9Ba'"    |       40 |    225.468243 |            3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,10} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
  /15/"\x15\xd3\xe7\xb3_\xa9A\"\xb5p\xa2\xf5\xb9Ba'"    | NULL                                                  |       41 |    222.721298 |            8 | cloud=gce,region=us-west1,zone=us-west1-b | {3,4,8}  | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-b"}
Enter fullscreen mode Exit fullscreen mode

Note that credits has 2 secondary indexes. Notice how the leaseholder of the ranges are spread across both regions (check the lease_holder_locality column).

SHOW RANGES FROM TABLE offers;
Enter fullscreen mode Exit fullscreen mode
  start_key | end_key | range_id | range_size_mb | lease_holder |           lease_holder_locality           | replicas |                                                          replica_localities
------------+---------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
  NULL      | NULL    |       37 |             0 |            3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,12} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
(1 row)
Enter fullscreen mode Exit fullscreen mode

Notice how table offers has 1 secondary index, and the table is empty (range_size_mb is 0).

SQL Queries

Here is the SQL query

SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '000000'

UNION

SELECT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c, offers AS o
WHERE c.id = o.id
  AND c.code = o.code
  AND c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND o.token = 'c744250a-1377-4cdf-a1f4-5b85a4d29aaa';
Enter fullscreen mode Exit fullscreen mode

Run the load test

You start the app load test.
While it runs, check the Metrics in the DB Console. Open the Hardware dashboard to see if you can replicate the spike in high CPU usage.

cpu

Notice how 2 nodes have very high CPU usage compared to all other nodes. Take notice in the Summary of the values for QPS - 4046 - and P99 latency - 402ms -, too.

Check the Service Latency charts in the SQL dashboard for a better understanding.

sql-p99

Stop the workload now, indeed there are high CPU spikes and high latency.

Analyze the Queries

Let's break the query down into 2 parts, and let's pull the query plan for the 1st part. Again, the value 000000 is a placeholder for a value passed by the application.

EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '000000';
Enter fullscreen mode Exit fullscreen mode
          tree         |        field        |                                   description                                    |                        columns                         | ordering
-----------------------+---------------------+----------------------------------------------------------------------------------+--------------------------------------------------------+-----------
                       | distribution        | local                                                                            |                                                        |
                       | vectorized          | false                                                                            |                                                        |
  project              |                     |                                                                                  | (id, code, channel, status, end_date, start_date)      |
   │                   | estimated row count | 0                                                                                |                                                        |
   └── filter          |                     |                                                                                  | (id, code, channel, pid, end_date, status, start_date) |
        │              | estimated row count | 0                                                                                |                                                        |
        │              | filter              | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') |                                                        |
        └── index join |                     |                                                                                  | (id, code, channel, pid, end_date, status, start_date) |
             │         | estimated row count | 0                                                                                |                                                        |
             │         | table               | credits@primary                                                                  |                                                        |
             │         | key columns         | id, code                                                                         |                                                        |
             └── scan  |                     |                                                                                  | (id, code, pid)                                        |
                       | estimated row count | 0                                                                                |                                                        |
                       | table               | credits@credits_pid_idx                                                          |                                                        |
                       | spans               | /0-/1                                                                            |                                                        |
(15 rows)

Time: 77ms total (execution 77ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode

So the optimizer is leveraging index credits@credits_pid_idx to filter rows that have that specific pid, but then it has to do a join with primary to fetch status, end_date and start_date to finish the rest of the WHERE, and SELECT, clauses.

Wouldn’t it be better if it didn’t have to do this join and instead access a single index?

Let's now pull the plan for the second part

EXPLAIN (VERBOSE) SELECT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c, offers AS o
WHERE c.id = o.id
  AND c.code = o.code
  AND c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND o.token = 'c744250a-1377-4cdf-a1f4-5b85a4d29aaa';
Enter fullscreen mode Exit fullscreen mode
            tree           |         field         |                                     description                                     |                              columns                               | ordering
---------------------------+-----------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------+-----------
                           | distribution          | full                                                                                |                                                                    |
                           | vectorized            | false                                                                               |                                                                    |
  project                  |                       |                                                                                     | (id, code, channel, status, end_date, start_date)                  |
   │                       | estimated row count   | 0                                                                                   |                                                                    |
   └── lookup join (inner) |                       |                                                                                     | (id, code, token, id, code, channel, end_date, status, start_date) |
        │                  | estimated row count   | 0                                                                                   |                                                                    |
        │                  | table                 | credits@credits_code_id_idx                                                         |                                                                    |
        │                  | equality              | (code, id) = (code,id)                                                              |                                                                    |
        │                  | equality cols are key |                                                                                     |                                                                    |
        │                  | pred                  | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20')    |                                                                    |
        └── scan           |                       |                                                                                     | (id, code, token)                                                  |
                           | estimated row count   | 1                                                                                   |                                                                    |
                           | table                 | offers@offers_token_idx                                                             |                                                                    |
                           | spans                 | /"\xc7D%\n\x13wLߡ\xf4[\x85\xa4Қ\xaa"-/"\xc7D%\n\x13wLߡ\xf4[\x85\xa4Қ\xaa"/PrefixEnd |                                                                    |
Enter fullscreen mode Exit fullscreen mode

Here we see that the optimizer is choosing an index to filter from the offers table and join with credits, which is fine.

Addressing the Hotspot

Let's tackle the high CPU usage issue first. Why is it so, why is a node, n3 in this case, using all the CPU?

You restart you workload and execute hot.py to find the hottest range.

$ python3 hot.py --numtop 10 --host myhost.mydomain.com --adminport 26258 --dbport 26257  
rank  rangeId          QPS           Nodes       leaseHolder    DBname, TableName, IndexName
  1:       37   2006.722472     [4, 3, 12]                 3    ['defaultdb', 'offers', '']
  2:       39   857.900812       [5, 2, 8]                 5    ['defaultdb', 'credits', 'credits_pid_idx']
  3:        6    48.688882      [1, 6, 3, 11, 9]                   9    ['', '', '']
  4:       26    17.644921      [1, 4, 11, 12, 7]                  4    ['system', 'namespace2', '']
  5:        4    15.409775      [1, 9, 12]                12    ['', '', '']
  6:       35    12.764951      [4, 3, 9, 8, 10]                  10    ['system', 'sqlliveness', '']
  7:       11     3.369730      [6, 2, 9, 12, 10]                  2    ['system', 'jobs', '']
  8:        2     2.697347      [1, 5, 3, 12, 7]                  12    ['', '', '']
  9:        3     1.976292      [6, 4, 3, 11, 7]                   6    ['', '', '']
 10:        7     1.373621      [1, 2, 4, 12, 7]                   4    ['system', 'lease', '']
Enter fullscreen mode Exit fullscreen mode

So it looks like Range ID 37 on n3 is hot. What's in that range, why that range?

Back to your SQL terminal, show the ranges for offers@offers_token_idx, since the query plan showed it's using this index

SHOW RANGES FROM INDEX offers@offers_token_idx;
Enter fullscreen mode Exit fullscreen mode
  start_key | end_key | range_id | range_size_mb | lease_holder |           lease_holder_locality           | replicas |                                                          replica_localities
------------+---------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
  NULL      | NULL    |       37 |             0 |            3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,12} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
(1 row)

Time: 2.517s total (execution 2.516s / network 0.000s)
Enter fullscreen mode Exit fullscreen mode

Bingo! We found Range ID 37.

As offers is empty, so is index offers@offers_token_idx, and thus there is just one range for that table. If you have a join operation going on, inevitably CockroachDB will always want to access that range to do the join, causing the hotspot.

We need to ask our DBA:

  • Why is a join operation sent against an empty table?
  • Why is the table empty?

There are 4 possible ways to resolve a hot range situation:

  1. Add data to the table, so that more ranges are created.

  2. Use the Follower Reads pattern, where you relax the requirement to always go to the leaseholder to access the data and instead can read from any of the replicas. Range replicas are spread across the cluster, balancing the load to other nodes.

  3. Use the Duplicate Indexes pattern, where you duplicate the table/index and thus its ranges, resulting in multiple ranges that can serve the data.

  4. Use the CONFIGURE ZONE command to decrease the range size from 512MB to something smaller, 64MB for example. This will create many, small, scattered ranges that help balance the load.

  5. You can insert some junk data in the table or keep the old, obsolete data, so to keep enough data that the table results in multiple ranges.

In this case, you add a lot of data to the table. As a result, 4 ranges have been created and these will be spread across the nodes.

Re-run the workload, then check the Hardware dashboard again

cpu-even

Much better, good job! Let's see how the ranges for the index are spread out:

SELECT lease_holder, lease_holder_locality 
FROM [
  SHOW RANGES 
  FROM INDEX offers@offers_token_idx
];
Enter fullscreen mode Exit fullscreen mode
  lease_holder |           lease_holder_locality
---------------+--------------------------------------------
             5 | cloud=gce,region=us-east1,zone=us-east1-c
             4 | cloud=gce,region=us-east1,zone=us-east1-c
             3 | cloud=gce,region=us-east1,zone=us-east1-b
             2 | cloud=gce,region=us-east1,zone=us-east1-b
Enter fullscreen mode Exit fullscreen mode

Better! On average we can expect the load to be spread across 4 ranges in 4 different nodes.

Addressing the Latency

Understanding where the latency comes from

On the SQL terminal, you run a few queries to verify the Response Time.

Show my locality first

SHOW LOCALITY;
Enter fullscreen mode Exit fullscreen mode
                  locality
---------------------------------------------
  cloud=gce,region=us-east1,zone=us-east1-b
(1 row)

Time: 2ms total (execution 1ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode

Ok, I'm in US East. Let's run the first part of the query using a randomly picked valid c.pid.

SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '1109619';
Enter fullscreen mode Exit fullscreen mode
  id |                 code                 | channel | status |         end_date          |        start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
   9 | f99e6553-18fb-475b-910e-eae4287e7ffa | O       | A      | 2020-12-19 00:00:00+00:00 | 2020-05-04 00:00:00+00:00
(1 row)

Time: 67ms total (execution 67ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode

Response Time is 69ms, a little too much. Why is it so? Let's check where the range that has this row is located.

From the query plan we pulled above, we see that it's using index credits_pid_idx. Find the key of the index

SHOW INDEX FROM credits;
Enter fullscreen mode Exit fullscreen mode
  table_name |     index_name      | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+---------------------+------------+--------------+-------------+-----------+---------+-----------
[...]

  credits    | credits_pid_idx     |    true    |            1 | pid         | ASC       |  false  |  false
  credits    | credits_pid_idx     |    true    |            2 | id          | ASC       |  false  |   true
  credits    | credits_pid_idx     |    true    |            3 | code        | ASC       |  false  |   true

[...]
Enter fullscreen mode Exit fullscreen mode

Cool, for credits@credits_pid_idx the key is pid id code.
Let's pull the correct range

SELECT lease_holder_locality
FROM [
  SHOW RANGE FROM INDEX credits@credits_pid_idx 
  FOR ROW(1109619, 9, 'f99e6553-18fb-475b-910e-eae4287e7ffa')
];
Enter fullscreen mode Exit fullscreen mode
            lease_holder_locality
---------------------------------------------
  cloud=gce,region=us-east1,zone=us-east1-b
Enter fullscreen mode Exit fullscreen mode

Ok, the range is local (us-east-1), this should only take 1ms. From the query plan we see that there is a join with credits@primary to fetch the other columns.

Let's see how long that takes

SELECT *
FROM credits@primary
WHERE id = 9 AND code = 'f99e6553-18fb-475b-910e-eae4287e7ffa';
Enter fullscreen mode Exit fullscreen mode
  id |                 code                 | channel |   pid   |         end_date          | status |        start_date
-----+--------------------------------------+---------+---------+---------------------------+--------+----------------------------
   9 | f99e6553-18fb-475b-910e-eae4287e7ffa | O       | 1109619 | 2020-12-19 00:00:00+00:00 | A      | 2020-05-04 00:00:00+00:00
(1 row)

Time: 66ms total (execution 66ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode

66ms! Let's do the same exercise as before and find out where this range is located.

SELECT lease_holder_locality
FROM [
  SHOW RANGE 
  FROM TABLE credits 
  FOR ROW(9, 'f99e6553-18fb-475b-910e-eae4287e7ffa')
];
Enter fullscreen mode Exit fullscreen mode
           lease_holder_locality           
-------------------------------------------
 cloud=gce,region=us-west1,zone=us-west1-c 
Enter fullscreen mode Exit fullscreen mode

A-ha! This table is in US West, so we're paying the latency price to go to the other region to fetch the data.

The problem is twofold: sub-optimal tables/indexes and cross-regional reads.

Part 1 - Optimize the table primary and secondary indexes

Let's optimize the top part of the query by removing the need to join with credits@primary.

We need to create an index similar to credits@credits_pid_idx that stores the fields required by the query.

Also, index credits@credits_code_id_idx seems to be redundant, so we drop it.

DROP INDEX credits@credits_code_id_idx;
DROP INDEX credits@credits_pid_idx;
CREATE INDEX credits_pid_idx
ON credits(pid ASC)
STORING (channel, end_date, status, start_date);
Enter fullscreen mode Exit fullscreen mode

Pull the query plan to confirm no join is required

EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '12';
Enter fullscreen mode Exit fullscreen mode
       tree      |        field        |                                   description                                    |                        columns                         | ordering
-----------------+---------------------+----------------------------------------------------------------------------------+--------------------------------------------------------+-----------
                 | distribution        | local                                                                            |                                                        |
                 | vectorized          | false                                                                            |                                                        |
  project        |                     |                                                                                  | (id, code, channel, status, end_date, start_date)      |
   │             | estimated row count | 0                                                                                |                                                        |
   └── filter    |                     |                                                                                  | (id, code, channel, pid, end_date, status, start_date) |
        │        | estimated row count | 0                                                                                |                                                        |
        │        | filter              | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') |                                                        |
        └── scan |                     |                                                                                  | (id, code, channel, pid, end_date, status, start_date) |
                 | estimated row count | 0                                                                                |                                                        |
                 | table               | credits@credits_pid_idx                                                          |                                                        |
                 | spans               | /12-/13                                                                          |                                                        |
Enter fullscreen mode Exit fullscreen mode

Good stuff, we eliminated a join operation!

As per the second part, we see that the optimizer is never using index offers@primary. Let's alter the primary key of that table to make it similar to index offers@offers_token_idx.

DROP INDEX offers_token_idx;

-- this will take some time as we're basically rewriting the entire table
BEGIN;
  ALTER TABLE offers DROP CONSTRAINT "primary";
  ALTER TABLE offers ADD CONSTRAINT "primary" PRIMARY KEY (token, id, code);
COMMIT;
Enter fullscreen mode Exit fullscreen mode

Check the docs for how to change the primary key while still keeping the database online.

Review the schema after these changes.

SHOW CREATE TABLE offers;
Enter fullscreen mode Exit fullscreen mode
  table_name |                          create_statement
-------------+----------------------------------------------------------------------
  offers     | CREATE TABLE public.offers (
             |     id INT4 NOT NULL,
             |     code UUID NOT NULL,
             |     token UUID NOT NULL,
             |     start_date DATE NULL,
             |     end_date DATE NULL,
             |     CONSTRAINT "primary" PRIMARY KEY (token ASC, id ASC, code ASC),
             |     FAMILY "primary" (id, code, token, start_date, end_date)
             | )
Enter fullscreen mode Exit fullscreen mode

Awesome, much more efficient!

Part 2 - Create duplicate indexes and pin to region

Now that we have our tables well organized, we need to resolve the latency issue.

We need our read latency to be the same regardless of where the query originates.

Implementing the Follower Reads pattern requires an App modification, so while ideal the Follower Reads pattern is unfortunately not available.

The best solution thus is to follow the Duplicate Index pattern: we create a copy of each index and table.

Then, we pin a copy to US West and another to US East.

We create the indexes first

-- copy of credits@primary
CREATE INDEX primary_copy
ON credits(id ASC, code ASC)
STORING (channel, pid, end_date, status, start_date);

-- copy of credits_pid_idx
CREATE INDEX credits_pid_idx_copy
ON credits(pid ASC)
STORING (channel, end_date, status, start_date);

-- copy of offers@primary
CREATE INDEX primary_copy
ON offers(token ASC, id ASC, code ASC)
STORING (start_date, end_date);
Enter fullscreen mode Exit fullscreen mode

Excellent, we have now a copy of each index (primary included)!

Next, pin a copy to East, and another to West.

-- credits
--   pin to East
ALTER TABLE credits CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-east1: 1}',
  lease_preferences = '[[+region=us-east1]]';

ALTER INDEX credits@credits_pid_idx CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-east1: 1}',
  lease_preferences = '[[+region=us-east1]]';

--   pin to West
ALTER INDEX credits@primary_copy CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-west1: 1}',
  lease_preferences = '[[+region=us-west1]]';

ALTER INDEX credits@credits_pid_idx_copy CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-west1: 1}',
  lease_preferences = '[[+region=us-west1]]';

-- offers
--    pin to East
ALTER TABLE offers CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-east1: 1}',
  lease_preferences = '[[+region=us-east1]]';

--    pin to West
ALTER INDEX offers@primary_copy CONFIGURE ZONE USING
  num_replicas = 3,
  constraints = '{+region=us-west1: 1}',
  lease_preferences = '[[+region=us-west1]]';
Enter fullscreen mode Exit fullscreen mode

Part 3 - Validate the theory

Re-run the first part of the query from both regions. Check the query plan using EXPLAIN (VERBOSE).

From n1 (US East region):

SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '12';
Enter fullscreen mode Exit fullscreen mode
  id |                 code                 | channel | status |         end_date          |        start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
  19 | 468750f4-cb58-4707-9fd3-bd5f99111855 | O       | A      | 2020-12-18 00:00:00+00:00 | 2020-09-21 00:00:00+00:00
(1 row)

Time: 1ms total (execution 1ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode
EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
  AND c.end_date >= '2020-11-20'
  AND c.start_date <= '2020-11-20'
  AND c.pid = '12';
Enter fullscreen mode Exit fullscreen mode
       tree      |        field        |                                      description                                      |                        columns                         | ordering
-----------------+---------------------+---------------------------------------------------------------------------------------+--------------------------------------------------------+-----------
                 | distribution        | local                                                                                 |                                                        |
                 | vectorized          | false                                                                                 |                                                        |
  project        |                     |                                                                                       | (id, code, channel, status, end_date, start_date)      |
   │             | estimated row count | 0                                                                                     |                                                        |
   └── filter    |                     |                                                                                       | (id, code, channel, pid, end_date, status, start_date) |
        │        | estimated row count | 0                                                                                     |                                                        |
        │        | filter              | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') |                                                        |
        └── scan |                     |                                                                                       | (id, code, channel, pid, end_date, status, start_date) |
                 | estimated row count | 0                                                                                     |                                                        |
                 | table               | credits@credits_pid_idx                                                               |                                                        |
                 | spans               | /3124791208-/3124791209                                                               |                                                        |                                                        |                                                        |
Enter fullscreen mode Exit fullscreen mode

Same queries above run on n12 (US West region):

  id |                 code                 | channel | status |         end_date          |        start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
  19 | 468750f4-cb58-4707-9fd3-bd5f99111855 | O       | A      | 2020-12-18 00:00:00+00:00 | 2020-09-21 00:00:00+00:00
(1 row)

Time: 2ms total (execution 1ms / network 0ms)
Enter fullscreen mode Exit fullscreen mode
       tree      |        field        |                                      description                                      |                        columns                         | ordering
-----------------+---------------------+---------------------------------------------------------------------------------------+--------------------------------------------------------+-----------
                 | distribution        | local                                                                                 |                                                        |
                 | vectorized          | false                                                                                 |                                                        |
  project        |                     |                                                                                       | (id, code, channel, status, end_date, start_date)      |
   │             | estimated row count | 1                                                                                     |                                                        |
   └── filter    |                     |                                                                                       | (id, code, channel, pid, end_date, status, start_date) |
        │        | estimated row count | 1                                                                                     |                                                        |
        │        | filter              | ((status = 'A') AND (end_date >= '2020-11-19')) AND (start_date <= '2020-11-19') |                                                        |
        └── scan |                     |                                                                                       | (id, code, channel, pid, end_date, status, start_date) |
                 | estimated row count | 2                                                                                     |                                                        |
                 | table               | credits@credits_pid_idx_copy                                                          |                                                        |
                 | spans               | /3124791208-/3124791209                                                               |                                                        |
Enter fullscreen mode Exit fullscreen mode

Perfect, we've low latency from both regions! Now start the workload again and let's measure the overall latency.

Compare to the initial result: A huge improvement in performance! We doubled the QPS and halved the Latency!

final

Congratulations, your cluster is performing much better now!
We hope you have now a better understanding on the process of troubleshooting an underperforming cluster.

References

Official CockroachDB Documentation

Blogs

. . . . . . . . . . . . . . . . . . . . .