amazon-web-services pyspark apache-spark-sql aws-glue aws-glue-spark

AWS Glue Python Job not creating new Data Catalog partitions

I created a AWS Glue Job using Glue Studio. It takes data from a Glue Data Catalog, does some transformations, and writes to a different Data Catalog.

When configuring the target node, I enabled the option to create new partitions after running:

The job runs successfully, data is written to S3 with proper partition folder structure, but no new partitions are created in the actual Data Catalog table - I still have to run a Glue Crawler to create them.

The code in the generated script that is responsible for partition creation is this (last two lines of the job):

DataSink0 = glueContext.write_dynamic_frame.from_catalog(frame = Transform4, database = "tick_test", table_name = "test_obj", transformation_ctx = "DataSink0", additional_options = {"updateBehavior":"LOG","partitionKeys":["date","provider"],"enableUpdateCatalog":True})
job.commit()

What am I doing wrong? Why are new partitions not being created? How do I avoid having to run a crawler to have the data available in Athena?

I am using Glue 2.0 - PySpark 2.4

Solution

As highlighted in documentation, there are restrictions with adding new partitions to data catalogs, more specifically, please make sure your use case is not contradicting any of the following:

Only Amazon Simple Storage Service (Amazon S3) targets are supported.

Only the following formats are supported: json, csv, avro, and parquet.

To create or update tables with the parquet classification, you must utilize the AWS Glue optimized parquet writer for DynamicFrames.

When the updateBehavior is set to LOG, new partitions will be added only if the DynamicFrame schema is equivalent to or contains a subset of the columns defined in the Data Catalog table's schema.

Your partitionKeys must be equivalent, and in the same order, between your parameter passed in your ETL script and the partitionKeys in your Data Catalog table schema.