Search code examples
amazon-web-servicesamazon-s3amazon-redshiftfilesize

why AWS file size is different between Redshift and S3?


I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.

I just did UNLOAD on a table that has size = 1,056 according to:

select "table", size, tbl_rows
FROM svv_table_info;

According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).

I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?

Can someone please explain? Thank you.


Solution

  • So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?

    There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.

    The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.

    For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.

    You can see the number of blocks that each column uses in STV_BLOCKLIST.