I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem: