Search code examples
apache-sparkamazon-redshiftamazon-redshift-spectrum

Inserts into Redshift using spark-redshift


I am trying to insert in Redshift data from S3 (parquet files). Doing it through SQLWorkbench it takes 46 seconds for 6 million rows. But doing it through the connector spark-redshift it takes about 7 minutes.

I am trying it with more nodes and getting same result.

Any suggestions to improve the time using spark-redshift?

The code in Spark:

val df = spark.read.option("basePath", "s3a://parquet/items").parquet("s3a://parquet/items/Year=2017/Month=7/Day=15")

df.write
      .format("com.databricks.spark.redshift")
      .option("url", "jdbc:....")
      .option("dbtable", "items")
      .option("tempdir", "s3a://parquet/temp")
      .option("aws_iam_role", "...")
      .option("sortkeyspec", "SORTKEY(id)")
      .mode(SaveMode.Append)
      .save()

The code in SQLWorkbench (Redshift SQL):

CREATE EXTERNAL TABLE items_schema.parquet_items("id type, column2 type....")
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS PARQUET
LOCATION 's3://parquet/items/Year=2017/Month=7/Day=15';

CREATE TABLE items ("id type, column2 type....");

INSERT INTO items (SELECT * FROM items_schema.parquet_items); 

Solution

  • I would say your snippets are mislabelled:

    • This is Spark code val df = spark.read…
    • This is Redshift SQL CREATE EXTERNAL TABLE…

    When you use the external table (Redshift Spectrum) it does the following:

    • Read the parquet data in the location defined.
    • Insert the data into a normal Redshift table as shown.

    When you use the Spark code to write the data to Redshift, using spark-redshift, it does the following:

    • Spark reads the parquet files from S3 into the Spark cluster.
    • Spark converts the parquet data to Avro format and writes it to S3.
    • Spark issues a COPY SQL query to Redshift to load the data.
    • Redshift loads the Avro data from S3 to the final table.

    Basically the Spark code is doing a lot more work, reading the data twice and writing it twice in different formats. The Redshift Spectrum SQL is reading the data once and writing it once into Redshift itself (much faster than sending to S3 over the network).