Search code examples
apache-sparkparquetaws-gluepyspark

Reading data from s3 subdirectories in PySpark


I want to read all parquet files from an S3 bucket, including all those in the subdirectories (these are actually prefixes).

Using wildcards (*) in the S3 url only works for the files in the specified folder. For example using this code will only read the parquet files below the target/ folder.

df = spark.read.parquet("s3://bucket/target/*.parquet")
df.show()

Let say i have a structure like this in my s3 bucket:

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

The above code will raise the exception:

pyspark.sql.utils.AnalysisException: 'Path does not exist: s3://mailswitch-extract-underwr-prod/target/*.parquet;'

How can I read all the parquet files from the subdirectories from my s3 bucket?

To run my code, I am using AWS Glue 2.0 with Spark 2.4 and python 3.


Solution

  • If you want to read all parquet files below the target folder

    "s3://bucket/target/2020/01/01/some-file.parquet"
    "s3://bucket/target/2020/01/02/some-file.parquet"
    

    you can do

    df = spark.read.parquet("bucket/target/*/*/*/*.parquet")
    

    The downside is that you need to know the depth of your parquet files.