amazon-redshift sql-insert identity-column csv-import

Redshift Copy command identity column is alternate value due to number of slices

I am trying to achieve sequential incremental values in the identity column of Redshift while running copy command.

Redshift-Identity column SEED-STEP behavior with COPY command is an excellent article I followed to slowly move towards my target, but even after following the last step from the list and using the manifest file, I could only get (alternatively incrementing) 1,3,5,7... or 2,4,6,8... ID column values.

While creating table, I give that column as:

  bucketingid                             INT IDENTITY(1, 1) sortkey

I can understand the behavior is because my dc2.large single node cluster has 2 slices and hence I am getting the issue.

I am trying to upload a single csv file from S3 to redshift.

How can I achieve sequential incremental IDs?

Solution

The IDENTITY column is not guaranteed to produce consecutive values. It guarantees to assign unique and monotonic values.

You can solve your problem with some sql once the data is loaded:

CREATE TABLE my_table_with_consecutive_ids AS 
    SELECT 
       row_number() over (order by bucketingid) as consecutive_bucketingid, 
       *
    FROM my_table

Some explaination why the problem occurs:

Since COPY performs distributed loading of your data, and each file is loaded by a node slice, loading only one file will be handled by a single slice. To be able to guarantee unique values while loading data in parallel by different slices, each of them is using an space of identities exclusive to itself (with 2 slices, one uses odd, and the other one even numbers).

Theoretically, you can have consecutive ids after loading the data if you split the file in two (or whatever is the number of slices your cluster has) and use both slices for loading (you'll need to use MANIFEST file), but it's highly impractical and you also make assumptions about your cluster size.

Same explaination from CREATE TABLE manual:

IDENTITY(seed, step)

... With a COPY operation, the data is loaded in parallel and distributed to the node slices. To be sure that the identity values are unique, Amazon Redshift skips a number of values when creating the identity values. As a result, identity values are unique and sequential, but not consecutive, and the order might not match the order in the source files.