I'm trying to use Spectrify to unload data from Redshift to S3 in Parquet format, but I'm stuck in the process because I can't understand a few things. Spectrify documentation isn't great and I can't find any implementation example on the internet. I've also found a similar question here on StackOverflow, but the accepted answer is that it's recommended to use Spectrify which doesn't help much.
Here is the problem (this is the code from their documentation):
from spectrify.export import RedshiftDataExporter
from spectrify.convert import ConcurrentManifestConverter
from spectrify.utils.schema import SqlAlchemySchemaReader
RedshiftDataExporter(sa_engine, s3_config).export_to_csv('my_table')
csv_path_template = 's3://my-bucket/my-table/csv/{start.year}/{start.month:02d}/{start.day:02d}'
spectrum_path_template = 's3://my-bucket/my-table/spectrum/partition_key={start}'
csv_path = csv_path_template.format(start=start_date)
spectrum_path = spectrum_path_template.format(start=start_date)
s3_config = SimpleS3Config(csv_path, spectrum_path)
sa_table = SqlAlchemySchemaReader(engine).get_table_schema('my_table')
ConcurrentManifestConverter(sa_table, s3_config).convert_manifest()
RedshiftDataExporter
is used to export data to CSV, sa_engine
is a connection to Redshift engine.
Their documentation is vague on the conversion process. What would be the process here which should be used to unload data to CSV and then turn it into Parquet format while using Spectrify in a Python 3.x script? How should I modify the above code and what am I missing?
You can now unload Redshift data to S3 in Parquet format without any third party application. The new feature is now supported in Redshift:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift