Search code examples
amazon-web-servicesapache-sparkpysparkapache-spark-sqlaws-glue

Performance issues with glue job while reading huge data from redshift


I am trying to run a query to read data from a Redshift table through an AWS Glue job.

Now, when it reads the data from the Redshift table, it creates a dataframe with only 1 partition and it takes a lot of time to read the data (Data reading from the Redshift table is done using a complex query that has multiple joins).

df = spark.read \
          .format("jdbc") \
          .option("url", url) \
          .option("query", query) \
          .option("numPartitions", 10) \
          .option("fetchsize", "10000") \
          .option("batchsize", "10000") \
          .load()

I used numPartitions as 10 and it is still reading the whole data in one partition. Not sure why it is not impacting to create 10 partitions while reading.

Can anyone help me understand why .option("numPartitions", 10) is not effective and how can we implement parallelism while reading data from Redshift to improve the performance of JDBC read?

Also is this the best approach to read data from Redshift by passing complex queries to the spark.read.jdbc() method?

Appreciate any help on this.

Update:

I tried with the suggestion given by Bogdan but still, it is not giving any improvement.

enter image description here

It's always running with 2 executors(1 is for the driver and 1 is the worker node). How do we improve read parallel? I tried different options as follows and I don't see any improvement. I am looking for reading with 10 to 20 executors and any suggestions greatly appreciated.

DF_options  = {
                "connectionName": connection,
                "url": url,
                "sampleQuery": query,
                "user": user,
                "password": pwd,
                "redshiftTmpDir": "s3://bucket/temp/",
                "hashfield": "location",
                "hashexpression": "id",
                "hashpartitions": "5"

        }

DF = glueContext.create_dynamic_frame_from_options(
                 connection_type="redshift", 
                 connection_options=DF_options,
                 transformation_ctx="DF"
     )

Thanks,

Bab


Solution

  • To read a JDBC source in parallel you need to define partitionColumn, lowerBound, upperBound, check documentation. Spark will use this configuration to split the original query in numPartitions queries.

    The query will be very roughly splitted like below and Spark can send the separated queries in parallel, if you have enough executors/cores/slots.

    -- Original query:
    SELECT *
    FROM my_table
    
    -- Queries with:
    --  partitionColumn = 'my_date'
    --  lowerBound = '2021-01-01'
    --  upperBound = '2021-01-02'
    --  numPartitions = 3
    
    SELECT *
    FROM my_table
    WHERE my_date < '2021-01-01'
    
    SELECT *
    FROM my_table
    WHERE my_date >= '2021-01-01'
      AND my_date < '2021-01-02'
    
    SELECT *
    FROM my_table
    WHERE my_date >= '2021-01-02'