Search code examples
apache-sparkamazon-s3pysparkminio

How to improve reading (listing) 25k of small files from s3 by Spark


I have 25,000 small files on Minio S3 to parse.

df = spark.read.text("s3a://bucket/*/*/file*.txt").withColumn("path", input_file_name())
# parsing
# writing to parquet

Parsing and writing to parquet is fast. But listing files by s3 api took about 40 minutes. Question, how to make listing faster?

I using Spark 3.1.1 with Hadoop 3.2.


Solution

  • This is really fast:

        df = spark.read.option("pathGlobFilter", "file*.txt"). \
        option("recursiveFileLookup", "true"). \
        text(f"s3a://bucket/").withColumn("path", input_file_name())