I have 25,000 small files on Minio S3 to parse.
df = spark.read.text("s3a://bucket/*/*/file*.txt").withColumn("path", input_file_name())
# parsing
# writing to parquet
Parsing and writing to parquet is fast. But listing files by s3 api took about 40 minutes. Question, how to make listing faster?
I using Spark 3.1.1 with Hadoop 3.2.
This is really fast:
df = spark.read.option("pathGlobFilter", "file*.txt"). \
option("recursiveFileLookup", "true"). \
text(f"s3a://bucket/").withColumn("path", input_file_name())