Search code examples
pythonjsondata-analysis

how to analyze json objects that are NOT separated by comma (preferably in Python)


So I've been trying to analyze data that are presumably given in json format but the objects not separated by commas. Here is a sample from my data:

{
  "areaId": "Tracking001",
  "areaName": "Learning Theater Indoor",
  "color": "#99FFFF"
}
{
  "areaId": "Tracking001",
  "areaName": "Learning Theater Indoor",
  "color": "#33CC00"
}

There are thousands of them, so manually separating them is not possible. So here is my question: - Do I have to separate it comma and put the overarching key and make everything else as value in order to analyze it? I'm a beginner to data analysis, especially for json formatted data so any tips would be appreciated.


Solution

  • The raw_decode(s) method from json.JSONDecoder sounds like what you need. To quote from its doc string:

    raw_decode(s): Decode a JSON document from s (a str beginning with a JSON document) and return a 2-tuple of the Python representation and the index in s where the document ended. This can be used to decode a JSON document from a string that may have extraneous data at the end.

    Example usage:

    import json
    
    s = """{
      "areaId": "Tracking001",
      "areaName": "Learning Theater Indoor",
      "color": "#99FFFF"
    }
    {
      "areaId": "Tracking001",
      "areaName": "Learning Theater Indoor",
      "color": "#33CC00"
    }"""
    decoder = json.JSONDecoder()
    v0, i = decoder.raw_decode(s)
    v1, _ = decoder.raw_decode(s[i+1:]) # i+1 needed to skip line break
    

    Now v0 and v1 hold the parsed json values.

    You may want to use a loop if you have thousands of values:

    import json
    
    with open("some_file.txt", "r") as f:
        content = f.read()
    parsed_values = []
    decoder = json.JSONDecoder()
    while content:
        value, new_start = decoder.raw_decode(content)
        content = content[new_start:].strip()
        # You can handle the value directly in this loop:
        print("Parsed:", value)
        # Or you can store it in a container and use it later:
        parsed_values.append(value)
    

    Using this code for 1000 of above json values took about 0.03 seconds on my computer. However, it will become inefficient for larger files, because it always reads the complete file.