I am trying to insert in Redshift data from S3 (parquet files). Doing it through SQLWorkbench it takes 46 seconds for 6 million rows. But doing it through the connector spark-redshift it takes about 7 minutes.
I am trying it with more nodes and getting same result.
Any suggestions to improve the time using spark-redshift?
The code in Spark:
val df = spark.read.option("basePath", "s3a://parquet/items").parquet("s3a://parquet/items/Year=2017/Month=7/Day=15")
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:....")
.option("dbtable", "items")
.option("tempdir", "s3a://parquet/temp")
.option("aws_iam_role", "...")
.option("sortkeyspec", "SORTKEY(id)")
.mode(SaveMode.Append)
.save()
The code in SQLWorkbench (Redshift SQL):
CREATE EXTERNAL TABLE items_schema.parquet_items("id type, column2 type....")
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS PARQUET
LOCATION 's3://parquet/items/Year=2017/Month=7/Day=15';
CREATE TABLE items ("id type, column2 type....");
INSERT INTO items (SELECT * FROM items_schema.parquet_items);
I would say your snippets are mislabelled:
val df = spark.read…
CREATE EXTERNAL TABLE…
When you use the external table (Redshift Spectrum) it does the following:
When you use the Spark code to write the data to Redshift, using spark-redshift
, it does the following:
COPY
SQL query to Redshift to load the data.Basically the Spark code is doing a lot more work, reading the data twice and writing it twice in different formats. The Redshift Spectrum SQL is reading the data once and writing it once into Redshift itself (much faster than sending to S3 over the network).