Search code examples
pythonamazon-web-servicesapache-sparkpysparkaws-glue

How can I read multiple S3 buckets using Glue?


When using Spark, I can read data from multiple buckets using the * in the prefix. For example, my folder structure is as follows:

s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.

Using PySpark, if I want to read all data for month 11, I can do:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

How I achieve the same functionality with Glue? The below does not seem to work:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
            connection_type="s3",
            connection_options = {
                "paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
            },
            format="parquet",
            transformation_ctx="df_spark")

Solution

  • I read the file using spark instead of Glue

    glueContext = GlueContext(SparkContext.getOrCreate())
    spark = glueContext.spark_session
    
    input_bucket = [MY-BUCKET]
    input_prefix = [MY-FOLDER/computation_date=2020-11-*]
    df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))