Search code examples
apache-sparkamazon-s3pysparkapache-spark-sqldatabricks

Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow


I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:

df = (spark.read
           .option("multiline", True)
           .option("inferSchema", False)
           .json(path))
display(df)

The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.

screenshot of spark job

Notes to consider:

  • Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
  • The json files were web-scraped from the same REST API
  • I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
  • AWS s3 bucket is already mounted so there is absolutely no need to use boto3

Solution

  • My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.

    At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!

    There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.

    zipfile solution: https://docs.python.org/3/library/zipfile.html

    AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f

    Really good blog posts explaining the small file problem:

    https://mungingdata.com/apache-spark/compacting-files/

    https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252