The R.A.G (Redshift Analyst Guide): Troubleshooting process

Alex Antra - Nov 20 '19 - - Dev Community

Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.


Previously on the R.A.G....


Stress
Are results slow?

Is WLM killing your query?

Do tables seem to not play ball?

There is a LOT to consider when trying to solve the above.

Here's the process for you to work through:

✅ Check the Explain Plan.

Plan
Do this for a single query at a time, not your whole script. But start by getting Redshift to tell you how it's going to execute your query. Make sure to look for actions with high costs, sequential scans or nested loops. If you can avoid them, or break your query into smaller tasks this will help you a lot.

✅ Understand the Distribution and Sorting of the tables you are dealing with

Understand
Whether the tables you are dealing with are built by you or someone else, their configuration could be working against you.

Run the below query and make note of:

  • distyle_and_key: How is the table distributed across nodes?
  • row_skew_ratio: This is the effectiveness of the dist key, the closer to 1 the better.
  • first_sortkey: How is it sorted on the node?
  • no_sort_keys: How many sort keys?
  • sortkey_skew_ratio: This is the effectiveness of the sort key, the closer to 1 the better.
  • percent_unsorted: How long has this table been since vacuum
  • stats_needed: If yes, the table needs 'analyst statistics' before the leader node knows how to handle it properly.
select  i.schema        as schema_location, 
        i.table         as table_name,
        i.encoded       as are_columns_encoded, 
        i.diststyle     as distyle_and_key,
        i.sortkey1      as first_sortkey,
        i.sortkey1_enc  as sortkey_compression,
        i.sortkey_num   as no_sort_keys,
        i.skew_sortkey1 as sortkey_skew_ratio,
        i.size          as size_in_blocks_mb,
        i.tbl_rows      as total_rows,
        i.skew_rows     as row_skew_ratio,
        i.pct_used      as percent_space_used,
        i.unsorted      as percent_unsorted,
        i.stats_off     as stats_needed
from    svv_table_info i
where   i.table = 'table_name'
limit   50
Enter fullscreen mode Exit fullscreen mode

✅ Run your query - read the error logs

Error
If you can run your query and its not being killed by the WLM or crashing, then check the Redshift error logs on how to make it run faster.

The table contains an EVENT and a SOLUTION table, this may provide some key information on how to make your query run faster.

select      l.event,
            l.solution,
            q.querytxt

from        stl_alert_event_log l

join        stl_query q
on          q.query = l.query

where       l.userid in (select usesysid from pg_user where usename ilike '%name%') --change to your name

order by    l.event_time desc

limit       10
Enter fullscreen mode Exit fullscreen mode

✅ Are you fighting for resources?

Fighting
In some scenarios your query may be slow because of lack of resources, or you had to wait until a slot opened up. Run the below query to see how your queries are being handled.

select      c.name as process_queue,
            w.slot_count,
            datediff(seconds,w.queue_start_time,w.queue_end_time) as q_wait_time_seconds,
            w.exec_start_time,
            w.exec_end_time,
            datediff(seconds,w.exec_start_time,w.exec_end_time) as exec_time_seconds,
            w.final_state,
            w.est_peak_mem,
            q.querytxt
from        stl_wlm_query w
join        stv_wlm_service_class_config c
on          c.service_class = w.service_class
join        stl_query q
on          q.query = w.query
where       w.userid in (select usesysid from pg_user where usename ilike '%user%') --change to your name
order  by   w.xid desc;

Enter fullscreen mode Exit fullscreen mode

✅ Understand best practice

Best practice


header image drawn by me


Who am I?

You should read....

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .