I want to read all parquet files from an S3 bucket, including all those in the subdirectories (these are actually prefixes).
Using wildcards (*) in the S3 url only works for the files in the specified folder. For example using this code will only read the parquet files below the target/
folder.
df = spark.read.parquet("s3://bucket/target/*.parquet")
df.show()
Let say i have a structure like this in my s3 bucket:
"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"
The above code will raise the exception:
pyspark.sql.utils.AnalysisException: 'Path does not exist: s3://mailswitch-extract-underwr-prod/target/*.parquet;'
How can I read all the parquet files from the subdirectories from my s3 bucket?
To run my code, I am using AWS Glue 2.0 with Spark 2.4 and python 3.
If you want to read all parquet files below the target folder
"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"
you can do
df = spark.read.parquet("bucket/target/*/*/*/*.parquet")
The downside is that you need to know the depth of your parquet files.