Search code examples
databaseamazon-web-servicesamazon-redshiftdata-warehouse

How does sortkey in Redshift work internally?


I'm a beginner to Redshift and Data Warehouses in general.

When a numeric or timestamp column is specified as sortkey, does the Redshift DBMS use binary search during a query to find the desired row as efficiently as possible?

I feel that knowing more about this would improve my table design skill.


Solution

  • Amazon Redshift is a columnar datastore, which means that each column is stored separately. This is great for wide tables because Redshift only needs to read in the columns that are specifically used in the query. The most time-consuming part of database queries is disk access, so anything that reduces/avoids disk access is a good thing.

    When data is stored on disk, it is stored in 1MB disk blocks. Each column can consume multiple blocks, and each block only contains data relating to one column. Redshift keeps a Zone Map of each block, which stores the minimum and maximum values stored in the block. For example, if a query is searching for data from 2021 and Redshift knows that the timestamp column for a particular block has a maximum value in 2018, it does not need to read the block from disk to examine the contents. This greatly reduces query time.

    Data is stored in the blocks based upon the selected Compression Encoding. These are very clever techniques for reducing the storage space for data. For example, if a column contains a list of Countries and the rows are sorted in alphabetical order by country, then Redshift could simply store the fact that the block contains Jamaica x 63, then Japan x 104, then Jordan x 26. This might only require 24 bytes to store 193 rows of data, and don't forget that each block is 1MB in size. Thus, compression reduces the amount of disk access required to retrieve data, again making queries faster.

    To answer your question about how Redshift would find the desired rows:

    • If the SORTKEY is used in a WHERE statement, then Redshift can quickly find the relevant blocks that potentially contain the desired data. I'm not sure if it does that with a binary search.
    • If WHERE statement does not use the SORTKEY, then finding the right rows is not as efficient because many blocks on disk might contain the rows that match the WHERE statement in various columns since they are not sorted together. This makes the query less efficient.
    • Redshift can still 'skip-over' blocks that do not contain matching data in the Zone Maps for all columns, avoiding the need to read those blocks from disk. Plus, compression on various columns can reduce the number of blocks that need to be read from disk.

    The general rules for Amazon Redshift are:

    • Set the DISTKEY to the column that is most frequently used in JOIN
    • Set the SORTKEY to the column that is most frequently used in WHERE

    See: Tuning query performance - Amazon Redshift