When creating a table on Redshift it's possible to specify (a) column(s) to use as sortkey(s). However, one can also set the sortkey to auto. For this case, Redshift comes with "automatic table optimizations". It looks at queries on the table and decides which sortkey(s) might be appropriate (e.g. a lot of queries select on the date column, so this is then used for sorting). Now, even after reading a lot in the docs and other resources, a simple question remains open: How can I tell, which sortkey Redshift currently uses for my query when auto mode is set on the table?
What I've tried:
Looking at SVV_ALTER_TABLE_RECOMMENDATIONS
. However, this is empty, because any of the changes RS potentially did to that table are long ago (the specific table I'm looking at is not new). Also pg_table_def
doesn't help, as it only indicates the sort columns that are manually specified as such. In SVV_TABLE_INFO
I can only see that my table is set to AUTO(SORTKEY)
, which I already know.
The only alternative I could think of, is to run some test queries using a specific column in the where clause. Then, trying to figure out if the number of scanned lines changes (or maybe compare query time?). But I struggle with this.
First, I'm not sure where to find that information. I thought SVL_QUERY_METRICS_SUMMARY
may help, as there is a scan_row_count
column. But the docs say:
The number of rows in a scan step. The row count is the total number of rows emitted before filtering rows marked for deletion (ghost rows) and before applying user-defined query filters.
So it seems that it doesn't take the query filter into account.
Second, even if this would work, one would need to test several columns and this exploratory approach seems inefficient.
Any other ideas?
One way to attack this is to look at the block metadata for the table and see what columns are in sort order. This assumes that the tables is sorted (vacuumed) and analyzed. STV_BLOCKLIST contains this data - see: https://docs.aws.amazon.com/redshift/latest/dg/r_STV_BLOCKLIST.html
Since this table is very detailed I recommend only looking at one slice and pivot the data so that each column's data is a column. You should see that only one column has non-overlapping max and min values for its metadata.
On a side note: While Redshift can do a fair job of picking a table sort key after some time running actual production queries, it won't fully optimize the table for your query load. It won't make stupid choices and it not a bad place to start if you have no usage info but it likely won't get the most out of your cluster.