Search code examples
pythonjsonjsonlines

Python conversion from JSON to JSONL


I wish to manipulate a standard JSON object to an object where each line must contain a separate, self-contained valid JSON object. See JSON Lines

JSON_file =

[{u'index': 1,
  u'no': 'A',
  u'met': u'1043205'},
 {u'index': 2,
  u'no': 'B',
  u'met': u'000031043206'},
 {u'index': 3,
  u'no': 'C',
  u'met': u'0031043207'}]

To JSONL:

{u'index': 1, u'no': 'A', u'met': u'1043205'}
{u'index': 2, u'no': 'B', u'met': u'031043206'}
{u'index': 3, u'no': 'C', u'met': u'0031043207'}

My current solution is to read the JSON file as a text file and remove the [ from the beginning and the ] from the end. Thus, creating a valid JSON object on each line, rather than a nested object containing lines.

I wonder if there is a more elegant solution? I suspect something could go wrong using string manipulation on the file.

The motivation is to read json files into RDD on Spark. See related question - Reading JSON with Apache Spark - `corrupt_record`


Solution

  • Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

    If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

    import json
    
    with open('output.jsonl', 'w') as outfile:
        for entry in JSON_file:
            json.dump(entry, outfile)
            outfile.write('\n')
    

    The default configuration for the json module is to output JSON without newlines embedded.

    Assuming your A, B and C names are really strings, that would produce:

    {"index": 1, "met": "1043205", "no": "A"}
    {"index": 2, "met": "000031043206", "no": "B"}
    {"index": 3, "met": "0031043207", "no": "C"}
    

    If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().