linux amazon-web-services apache-spark pyspark aws-glue

Dealing with large number of small json files using pyspark

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:

spark.read.json("path")

It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.

I used the below command to append the JSON records from different files under a single file:

for f in Agent/*.txt; do cat ${f} >> merged.json;done;

It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.

Solution

The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.

Since I was dealing with a JSON dataset, I used the jq utility to process it. Below is the shell script that would merge a large number of records in a faster way into one file:

find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt

Later on, I was able to load the JSON records as expected with the below code:

spark.read.option("multiline","true").json("path")