amazon-web-services pyspark databricks aws-glue aws-databricks

Aws Glue : Huge Databricks JDBC Dataset and pyspark parralelization

I'm using Databricks JDBC driver to get data from there using AWS Glue. The query returns 45M of rows.

I'm using DynamicFrame to read the data and also to write it in parquet as a single file on S3.

The problem is that the reading process seems to work as I Can see the query being executed and succeeded on the Databricks side.

On the AWS Glue side, I can see in the metrics part that after the 2min it takes to run the query on databricks, the glue cluster is starting to use its CPU and memory. The weird part is that only one executor is working and the others seems to be idle.

The process usually goes into timeout but sometimes it works "magically". It has only happened two times.

The final file is only 450mb and contains the complete data. So I don't understand why it doesn't for such a small file.

You'll find my code below :

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

# Get arguments passed from AWS Glue
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

# Create a GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())

# Define database and table names
database_name = "my_database"
table_name = "my_table"

# Define SQL query
query = f"SELECT * FROM {database_name}.{table_name}"

# Read data from Databricks using SQL query and create a DynamicFrame
dynamic_frame = glueContext.create_dynamic_frame_from_options(
    connection_type="jdbc",
    connection_options={
        "url": "jdbc:databricks://<databricks-instance>",
        "dbtable": f"({query}) t",
        "user": "<databricks-user>",
        "password": "<databricks-password>",
        "driver": "com.databricks.driver.jdbc.Driver"
    }
)

# Write the DynamicFrame to S3
glueContext.write_dynamic_frame.from_options(
    frame=dynamic_frame,
    connection_type="s3",
    connection_options={
        "path": "s3://<s3-bucket>/<s3-prefix>"
    },
    format="glueparquet"
)

P.S : I can't reduce more the output of my query as I need all the rows.

Thanks !

Solution

This is a standard behaviour of the JDBC data source in Spark (not sure about Glue, but it should be the same as it wraps Sprk) - by default it uses just a single core even if you have much more data. The standard approach to solving this problem is the specification of additional options like numPartitions + partitionColumn/lowerBound/upperBound to split source data into multiple queries that would be executed in parallel. This is well described in the Databricks documentation about using JDBC with Spark.