The R.A.G (Redshift Analyst Guide): Sorting & Zone Maps

Alex Antra - Nov 18 '19 - - Dev Community

Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.


Previously on the R.A.G....


While Distribution determines what node the data gets stored on, Sorting determines how it gets stored on the Node. Sorting, along with something called Zone Mapping, is how you speed up Redshift at a node level.

Diagram of a Redshift Cluster (to remind you)
A Redshift Cluster

Sorting and Sort Keys

Just like how you can determine the Dist Key at the time of table creation, you can also determine a sort key. A sort key determines in what order a column, or multiple get stored on the slice.

Got a table of cars and it's sorted by manufacture date? Then a query which contains manufacture date in it's 'where' clause will finish faster than if the table was unsorted. This is because the Compute Node now has all the dates in order and knows where certain date values start and stop. This results is the compute node skipping data it knows it doesn't need to look out, for example you may only want cars made before 2000, and so Redshift knows that the values after 2000 start 35% up the slice, so it goes from 0-34%.

What's actually happening at the slice level, is the blocks themselves are being re-arranged in the order specified. To recap, a slice contains blocks that are 1 MB big. So a 1 TB slice contains 1,000,000 blocks, a table might consume 2000 blocks, each block fitting in as much data as it can. The blocks containing the column you have determined as your sort key are the blocks that get sorted in order, all the other blocks then fall in place behind those 'leader blocks'.

The difference with a Sort Key vs a Dist Key is that you can have more than one column as your Sort key, and there are even two distinct types; Compound and Interleaved.

Compound Sort Key

By default Redshift uses a compound sort key, to which you can have up to 400 of them. Compound sort keys sort the data in the order of columns you specify, with the first column being the priority and every other column after that getting less priority.

So if you specify three sort keys, it will sort all the rows by the first column specified, then the second and then the third and so on so forth. Effectively each sort key after the first is diluted in it's effectiveness.

Where this falls down is when your query references anything BUT the first sort column, even the second sort column is separated enough to cause the Compute Node to work harder than it could have.

See the below example where Car Color is the first Sort Key while Car's Make is the second.

Color Make
Yellow Porsche
Yellow Audi
Yellow BMW
Yellow Tesla
Red Porsche
Red Audi
Red BMW
Red Tesla
White Porsche
White Audi
White BMW
White Tesla
Blue Porsche
Blue Audi
Blue BMW
Blue Tesla
  • As you can see, any queries that directly relate to color, are fine as that is the first Sort Key. A query for JUST Yellow cars will result in the Compute Node stopping after four rows as it knows where the Yellow data stops.
  • However, If you search for Porsche, the Compute Node still knows where these records end, the 13th row, but it still had to cover the rows before it, making this less cost efficient than Car Color.

Basically, your first sort key should always be the most likely thing in your 'where' clause. If I rarely search my table of cars on color, then it shouldn't be my first sort key.

Multiple Compound Sort Keys are very common on Redshift Tables, despite the drop off in performance. The performance is still better than being unsorted. Though I wouldn't expect you to use all 400! (You shouldn't have a table with 400 columns in the first place)

Using a compound sort key

Compound is the default so you actually don't need to specify it. So both commands below work.

-- Default
create table    schema.tablename
diststyle       Even
sortkey         (<columnname>,<columnname>)
as
--
select
from
where
;
-- Specified 
create table     schema.tablename
diststyle        Even
compound sortkey (<columnname>,<columnname>)
as
--
select
from
where
;
Enter fullscreen mode Exit fullscreen mode

Interleaved Sort Key

An interleaved sort key applies equal weight to all columns defined. It allows for multiple sort keys but without losing out entirely on performance. Perfect for when a table may be queried one of two ways in equal measure, i.e there's no clear primary sort key.

What happens with an interleaved sort key is that, the columns you specify as the sort keys, get equal weight. So below I have my table from before, with a compound sort key.

Color Make
Yellow Porsche
Yellow Audi
Yellow BMW
Yellow Tesla
Red Porsche
Red Audi
Red BMW
Red Tesla
White Porsche
White Audi
White BMW
White Tesla
Blue Porsche
Blue Audi
Blue BMW
Blue Tesla

This is what it would look like with an interleaved sort key.

Color Make
Yellow Porsche
Yellow Audi
Red Porsche
Red Audi
White Porsche
White Audi
Blue Porsche
Blue Audi
Yellow BMW
Yellow Tesla
Red BMW
Red Tesla
White BMW
White Tesla
Blue BMW
Blue Tesla

The thing to note here is that every columns has the same weight, which results in a lot of shuffling, especially with big data sets. That means you'll never approach the speed that a single compound sort key can give you, however, you won't suffer the sharp performance cost that the additional columns do under a compound sort key.

It's a good compromise however Interleaved sort keys can only be used in specific scenarios, because:

  • They only work on large tables as it takes advantage of the data being spread across multiple blocks.
  • Interleaved tables take longer to load and vacuum, as new data needs to be reshuffled against existing
  • Interleaved should only be used on values that have low unique options. For example Car Color would be good because there is small finite amount of colors so that's easy to combine with other values, however Manufacture Year would be too granular, even a hundred years would be too much.

Using an interleaved sort key

You must specify an interleaved sort key

create table     schema.tablename
diststyle        Even
interleaved sortkey (<columnname>,<columnname>)
as
--
select
from
where
;
Enter fullscreen mode Exit fullscreen mode

Picking the right Sort Key, or multiple...

Judgey
In terms of compound or interleaved, 99% of the time it's going to be compound. Interleaved is only for specific tables that are going to be queried often, updated little and have equal weight, think a BI model, rarely will a staging table for analysis need an interleaved key.

In terms of what column to pick for compound, well it depends on how you're going to query the table.

If you have distributed your table by key, then it's always a good idea to make that column your first sort key, otherwise you would loose some of the performance benefit by distributing based on key.

If you distributed Even or All, or have a good reason not to use the key column then the first sort key should be the column that's always going to be in your where clause, and makes the most sense to sort the table by. Date and integer columns perform better than character columns when it comes to sorting so keep that in mind.

Zone Mapping

maps
Right so I've alluded to Zone Maps several times over the past articles and here.
Zone Maps are what make Redshift run fast. It allows Redshift to include or exclude data quickly, without actually looking at the data.

Behind every single block, is a meta-data layer (table) that knows the minimum and maximum values in that block. So if I'm searching for cars manufactured after 2000 and I have a block that says it's minimum year is 1977 and it's maximum year is 1989, Redshift will skip it. This enables Redshift to zoom past hundreds of thousands of blocks in milliseconds to get to the right data.

Zone maps don't just store values for the sortkey, they store all min max values of all columns in a block. However unsorted values will be stored at random so will be very cost efficient. This data is also used by the Redshift Explain Plan (covered in a later article) to guess at how much work a table will cost it.

Redshift can tell you how good your Sort Key is

If you run the below query, after you have built your table, and look for the column 'sortkey_skew_ratio', the closer to one the better. If you have a ration of 5 that means that your sort key is bringing back 5 blocks of data more than it should. This often indicates that your sort key spans too many rows and it too broad. However, if your sort key is the thing you always query on, you may not be able to change everything. There will sometimes be scenarios where the data can't be fine tuned due to it's nature.

select          i.schema        as schema_location, 
                i.table         as table_name,
                i.encoded       as are_columns_encoded,
                i.diststyle     as distyle_and_key,
                i.sortkey1      as first_sortkey,
                i.sortkey1_enc  as sortkey_compression,
                i.sortkey_num   as no_sort_keys,
                i.skew_sortkey1 as sortkey_skew_ratio,
                i.size          as size_in_blocks_mb,
                i.tbl_rows      as total_rows,
                i.skew_rows     as row_skew_ratio,
                i.pct_used      as percent_space_used,
                i.unsorted      as percent_unsorted,
                i.stats_off     as stats_needed
from            svv_table_info i
where           i.schema = '<schemaname>'
and             i.table = '<tablename>'
limit           50;
Enter fullscreen mode Exit fullscreen mode

header image drawn by me


Who am I?

You should read....

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .