I have data which I am crawling into AWS Glue. There I am using PySpark and converting it to Parquet format. My original data is CSV looks something like this:
id, date, data 1, 202003, x 2, 202003, y 1, 202004, z
etc...
I am able to convert the data successfully, but I am unsure the best way to to get the desired output. The output should be split by id and date in S3. So it should have something like:
s3://bucket/outputdata/{id}_{date}/{data}.parquet
Where id
and date
are the actual id and date values in the data. The name of the files within obviously does not matter, I just want to be able to create "folders" in the S3 object prefix and split the data within them.
I am very new to AWS Glue and I have a feeling I am missing something very obvious.
Thanks in advance.
You can create a partition column by concatenating your two existing columns and then partition by the new column on write e.g.
from pyspark.sql.functions import concat, col, lit
df1 = df.withColumn('p', concat(col('id'), lit('_'), col('date')))
df1.write.partitionBy('p').parquet('s3://bucket/outputdata')