Search code examples
pythonsortingamazon-redshiftpandas-to-sql

Sorting and loading data from Pandas to Redshift using to_sql


I've built some tools that create front-end list boxes for users that reference dynamic Redshift tables. New items in the table, they appear automatically in the list.

I want to put the list in alphabetical order in the database so the dynamic list boxes will show the data in that order.

After downloading the list from an API, I attempt to sort the list alphabetically in a Pandas dataframe before uploading. This works perfectly:

df.sort_values(['name'], inplace=True, ascending=True, kind='heapsort')

But then when I try to upload to Redshift in that order, it loses the order while it uploads. The data appears in chunks of alphabetically ordered segments.

db_conn = create_engine('<redshift connection>')

obj.to_sql('table_name', db_conn, index = False, if_exists = 'replace')

Because of the way the third party tool (Alteryx) works, I need to have this data in alphabetical order in the database.

How can I modify to_sql to properly upload the data in order?


Solution

  • While ingesting data into redshift, data gets distributed between slices on each node in your redshift cluster.
    My suggestion would be to create a sort key on a column which you need to be sorted. Once you have sort key on that column, you can run Vacuum command to get your data sorted.
    Sorry! I cannot be of much help on Python/Pandas

    If I’ve made a bad assumption please comment and I’ll refocus my answer.