Search code examples
apache-sparkamazon-redshiftparquet

How do I export tables from redshift into Parquet format?


Couple of options I can think of

  • Spark streaming back into s3 using Redshift connector
  • UNLOAD into S3 gzipped then process with a command line tool

Not sure which is better. I'm not clear on how to easily translate the redshift schema into something parquet could intake but maybe the spark connector will take care of that for me.


Solution

  • Get the Redshift JDBC jar and use the sparkSession.read.jdbc with the redshift connection details like this in my example:

    val properties = new java.util.Properties() 
    properties.put("driver", "com.amazon.redshift.jdbc42.Driver") 
    properties.put("url", "jdbc:redshift://redshift-host:5439/") 
    properties.put("user", "<username>") properties.put("password",spark.conf.get("spark.jdbc.password", "<default_pass>")) 
    val d_rs = spark.read.jdbc(properties.get("url").toString, "data_table", properties)
    

    My relevant blog post: http://garrens.com/blog/2017/04/09/connecting-apache-spark-to-external-data-sources/

    Spark streaming should be irrelevant in this case.

    I would also recommend using databricks spark-redshift package to make the bulk unload from redshift and load into spark much faster.