Search code examples
pythonamazon-s3pyspark

Read multiple files from S3 into PySpark dataframe


Let's say I have a couple of files in an S3 bucket, similar to this:

s3://data-raw-dev/GoogleAds/ad_group/cron_name=customer2/year=2024/month=02/googleads_ad_group_customer2_2024-02-20.parquet
s3://data-raw-dev/GoogleAds/ad_group/cron_name=customer2/year=2024/month=02/googleads_ad_group_customer2_2024-02-19.parquet
s3://data-raw-dev/GoogleAds/ad_group/cron_name=customer3/year=2024/month=02/googleads_ad_group_customer3_2024-02-20.parquet
s3://data-raw-dev/GoogleAds/ad_group/cron_name=customer3/year=2024/month=02/googleads_ad_group_customer3_2024-02-19.parquet

To read a single parquet file into a PySpark dataframe is fairly straight forward:

df_staging = spark.read.parquet(s3_path)
df_staging.show()

I need to read multiple files into a PySpark dataframe based on the date in the file name. So without having to loop through customer names and reading file by file, how can I read all of the files that has a date of 2024-02-19 in their names for example?


Solution

  • Thank you for your answers. I managed to get it working by using: s3://data-raw-dev/GoogleAds/ad_group/**/{year=2024/month=02}/*2_2024-02-19.parquet