Search code examples
amazon-web-servicesamazon-s3aws-glue

Load partitioned json files from S3 in AWS Glue ETL jobs


I'm trying to load json files that are partitioned like this in an S3 storage :

|-json-data
   |-x=something
      |-y=something
         |-data.json

I'm loading them like this in my ETL job

datasource0 = glueContext.create_dynamic_frame_from_options('s3', 
{
  'paths': ['s3://bucket/json-data/'], 
  'recurse': True, 
  'groupFiles': 'inPartition', 
  'partitionKeys':['x', 'y']
}, 
format='json',
transformation_ctx = 'datasource0')

However when I try to read the schema using datasource0.printSchema() I don't have any partition in the schema. I need to have those partitions in the schema to do the transformations. After some research I'm not sure if this is a supported feature of create_dynamic_frame_from_options. Does someone know how to do this ?


Solution

  • You can only pass partitionKeys in write_dynamic_frame.from_options and not while reading from s3.For you to load specific partitions or filter them you need these partitions to be present already in source.

    So you need to either crawl using Glue crawler or create table in Athena with partitions. Once the table is available in Glue metadata you can load the table's partitions in to Glue ETL as shown below:

    glue_context.create_dynamic_frame.from_catalog(
        database = "my_S3_data_set",
        table_name = "catalog_data_table",
        push_down_predicate = my_partition_predicate)
    

    Please refer to below link on how you can leverage predicate pushdown:

    https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html