Search code examples
jsondata-sciencejqdata-analysisdata-cleaning

Pandas Dataframe to JSON: returns a single line for 1 million records


I need to do some processing on my JSON data but it turn outs that my JSON is formatted in a way that it contains only one row. On Terminal, wc -l file.json is returning 0

File is created converting Pandas Dataframe to JSON.

Here is the sample: file.json

[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]

Solution

  • I want to split it say 10,000 records per file.

    You could use jq to emit the top-level items in the array, one per line, as follows:

    jq -c '.[]' file.json
    

    If you simply want to partition this stream (without reconstituting each partition as an array), you can use a tool such as split.

    If you want each partition to be an array, you could use jq to form the partitions, and then use a tool such as awk to create the separate files. See for example this SO Q&A: Splitting / chunking JSON files with JQ in Bash or Fish shell?