I looked for similar examples, but all of them had a specific string in the paths with numbers in the end and hence were able to perform a for loop iteratively.
My scenario is as follows:
I have multiple parquet files in multiple partitions with the path as something like:
s3a://path/idate=2019-09-16/part-{some random hex key1}.snappy.parquet
s3a://path/idate=2019-09-16/part-{some random hex key2}.snappy.parquet
etc...
.
The {some random hex key}
is obviously not predictable and hence I cannot create a rule in iterative code definition.
I would like to have a for loop for example like:
files="s3a://path/idate=2019-09-16/"
for i in files
block{i}=spark.read.parquet(i)
where block{i}
are block1
, block2
etc. and are the iterative dataframes that would get created from s3a://path/idate=2019-09-16/part-{some random hex **key1,2, etc**..}.snappy.parquet
Is this even possible?
You can read all the files in files="s3a://path/idate=2019-09-16/"
using
df = spark.read.parquet(files)
.