Search code examples
apache-sparkamazon-s3wildcardparquet

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?


My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.

E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:

  1. df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
  2. df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")

but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.


Solution

  • The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:

    and threads:

    I think the only way is rename these files...