Search code examples
pythonjsonindentation

How to read independent indented JSON objects from a file (JSON Lines but indented)?


I have a massive file storing JSON objects next to each other which was supposed to be in JSON Lines format, but I made the big mistake of storing them indented (hence each object taking multiple lines instead of one).

It has this format (notice that , is missing between objects):

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Mon Sep 21 11:51:09 +0000 2020",
    "entities": {
        "hashtags": [],
        "symbols": []
    }
}
{
    "contributors": null,
    "coordinates": null,
    "created_at": "Mon Sep 21 11:51:09 +0000 2020",
    "entities": {
        "hashtags": [],
        "symbols": []
    }
}

All the project is made to work with non indented files (one JSON per line), and my idea was to transform this big JSON file to unindented format, but I'm struggling to find a way to read the file. The code I tried to transform it is:

import json
import sys
import os

FILE_INPUT='PathToTheBigFile'
FILE_OUTPUT='PathToConvertedFile'

tweets_list = []

for line in open(FILE_INPUT, 'r', encoding='utf-8'):
    tweets_list.append(json.loads(line))

with open(FILE_OUTPUT, 'a') as outfile:
    for tweet in tweets_list:
        outfile.write(json.dumps(tweet) + '\n')

And it works fine with non-indented files (it basically copies the file) but with the indented file this JSONDecodeError is raised:

json.decoder.JSONDecodeError: 
    Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

I have tried to do it in Python and also thought about doing it with Linux and tr commands or something like that, but I haven't found a way. I may try with other languages.

Any suggestion about how to do it?


Solution

  • The problem is that you have 2 objects next to each other in your JSON string/file. If you could add a comma between them (and wrap the whole thing in []), then you could parse it as an array of objects.

    Try something like this when reading in your file:

    import re
    
    with open(FILE_INPUT, 'r', encoding='utf-8') as file:
        json_data = re.sub(r"}\s*{", "},{", file.read())
        tweets_list.extend(json.loads("[" + json_data + "]"))
    

    Then, when writing your file, you should be saving it as an array of objects, instead of one object per-line. There's no reason to be calling json.dumps more than once.

    with open(FILE_OUTPUT, 'w') as outfile:
        outfile.write(json.dumps(tweets_list))
    

    Note I'm using 'w', so it's overwriting the file. You'll have you read in the whole file first so you can append to the array and write the whole file back out.


    If you are gonna be appending data to a file, and reading that file back in again, I'd suggest trying something like csv instead of json. You can easily append lines to a csv file without worry about parsing it back in later. Or maybe even an xml file could work here too.