Search code examples
pythonamazon-web-servicesamazon-s3parquet

Using Spectrify to offload data from Redshift to S3 in Parquet format


I'm trying to use Spectrify to unload data from Redshift to S3 in Parquet format, but I'm stuck in the process because I can't understand a few things. Spectrify documentation isn't great and I can't find any implementation example on the internet. I've also found a similar question here on StackOverflow, but the accepted answer is that it's recommended to use Spectrify which doesn't help much.

Here is the problem (this is the code from their documentation):

from spectrify.export import RedshiftDataExporter
from spectrify.convert import ConcurrentManifestConverter
from spectrify.utils.schema import SqlAlchemySchemaReader

RedshiftDataExporter(sa_engine, s3_config).export_to_csv('my_table')

csv_path_template = 's3://my-bucket/my-table/csv/{start.year}/{start.month:02d}/{start.day:02d}'
spectrum_path_template = 's3://my-bucket/my-table/spectrum/partition_key={start}'
csv_path = csv_path_template.format(start=start_date)
spectrum_path = spectrum_path_template.format(start=start_date)
s3_config = SimpleS3Config(csv_path, spectrum_path)

sa_table = SqlAlchemySchemaReader(engine).get_table_schema('my_table')
ConcurrentManifestConverter(sa_table, s3_config).convert_manifest()


RedshiftDataExporter is used to export data to CSV, sa_engine is a connection to Redshift engine.

Their documentation is vague on the conversion process. What would be the process here which should be used to unload data to CSV and then turn it into Parquet format while using Spectrify in a Python 3.x script? How should I modify the above code and what am I missing?


Solution

  • You can now unload Redshift data to S3 in Parquet format without any third party application. The new feature is now supported in Redshift:

    UNLOAD ('select-statement')
    TO 's3://object-path/name-prefix'
    FORMAT PARQUET
    

    Documentation can be found at UNLOAD - Amazon Redshift