Search code examples
amazon-web-servicesapache-sparkamazon-s3amazon-emr

Duplicate partition columns on write s3


I'm processing data and writing it to s3 using the following code:

spark = SparkSession.builder.config('spark.sql.sources.partitionOverwriteMode', 'dynamic').getOrCreate()
df = spark.read.parquet('s3://<some bucket>/<some path>').filter(F.col('processing_hr') == <val>)
transformed_df = do_lots_of_transforms(df)

# here's the important bit on how I'm writing it out
transformed_df.write.mode('overwrite').partitionBy('processing_hr').parquet('s3://bucket_name/location')

Basically, I'm trying to overwrite partitions with what I have in the data frame, but leave the previously processed partitions in s3.

This write continues to happen, but randomly fails with some consistency. The write silently fails. When I read the data back in from 's3://bucket_name/location', I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o85.parquet.
: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

    Partition column name list #0: processing_hr, processing_hr
    Partition column name list #1: processing_hr

For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:

    s3://bucket_name/location/processing_hr=2019-09-19 13%3A00%3A00
    s3://bucket_name/location/processing_hr=2019-09-19 20%3A00%3A00
    s3://bucket_name/location/processing_hr=2019-09-19 12%3A00%3A00/processing_hr=2019-09-19 12%3A00%3A00

I'm kind of baffled as to how this could happen. How do I keep spark from duplicating my partition column?

I've tried looking around at docs, the spark jira, and still can't seem to find anything even remotely related to this.


Solution

  • that is seems to be due to S3 eventual inconsistency issue . if using EMR < 5.30 use EMRFS consistent view . or else upgrade to EMR 5.30, in which latest EMRFS seems to have addressed this issue.