Search code examples
apache-spark-sqlparquetpyspark

Loading multiple parquet files iteratively using pyspark


I looked for similar examples, but all of them had a specific string in the paths with numbers in the end and hence were able to perform a for loop iteratively. My scenario is as follows: I have multiple parquet files in multiple partitions with the path as something like: s3a://path/idate=2019-09-16/part-{some random hex key1}.snappy.parquet s3a://path/idate=2019-09-16/part-{some random hex key2}.snappy.parquet etc.... The {some random hex key} is obviously not predictable and hence I cannot create a rule in iterative code definition. I would like to have a for loop for example like:

files="s3a://path/idate=2019-09-16/" 
for i in files
block{i}=spark.read.parquet(i)

where block{i} are block1, block2 etc. and are the iterative dataframes that would get created from s3a://path/idate=2019-09-16/part-{some random hex **key1,2, etc**..}.snappy.parquet

Is this even possible?


Solution

  • You can read all the files in files="s3a://path/idate=2019-09-16/" using df = spark.read.parquet(files).