Search code examples
apache-sparkamazon-s3pysparkaws-glue

How to partition S3 output files by a combination of column values?


I have data which I am crawling into AWS Glue. There I am using PySpark and converting it to Parquet format. My original data is CSV looks something like this:

id, date, data 1, 202003, x 2, 202003, y 1, 202004, z etc...

I am able to convert the data successfully, but I am unsure the best way to to get the desired output. The output should be split by id and date in S3. So it should have something like:

s3://bucket/outputdata/{id}_{date}/{data}.parquet

Where id and date are the actual id and date values in the data. The name of the files within obviously does not matter, I just want to be able to create "folders" in the S3 object prefix and split the data within them.

I am very new to AWS Glue and I have a feeling I am missing something very obvious.

Thanks in advance.


Solution

  • You can create a partition column by concatenating your two existing columns and then partition by the new column on write e.g.

    from pyspark.sql.functions import concat, col, lit
    df1 = df.withColumn('p', concat(col('id'), lit('_'), col('date')))
    df1.write.partitionBy('p').parquet('s3://bucket/outputdata')