Search code examples
pythonjsongenerator

Inconsistency of character indexes while trying to parse multiple JSON in a file


I am using the following code to parse JSON multiline objects separated by comma from a webscraped string stored in a .json file:

import json

def stream_read_json(fn):
    start_pos = 0
    with open(fn, 'r', encoding='utf-8') as f:
        while True:
            try:
                obj = json.load(f)
                yield obj
                return
            except json.JSONDecodeError as e:
                f.seek(start_pos)
                json_str = f.read(e.pos)
                obj = json.loads(json_str, encoding = 'utf-8')
                start_pos += e.pos
                yield obj

The first object is parsed correctly; the next ones are not. While testing random values of f.seek(start_pos), I see there is an inconsistency with the index found by except json.JSONDecodeError as e:. Why is this index different than the number of characters shown when I select on the IDE the text up until the character where the JSON object ends on the file?

How can I ensure the objects will be parsed correctly?

I tried to get f.seek(start_pos) for the second JSON object at debug prompt, but it differs greatly from e.pos thrown by the error.

A sample JSON is here:

{
  "user": {
    "id": 1,
    "profile": {
      "name": "Alice",
      "age": 30
    }
  },
  "product": {
    "sku": "A1234",
    "details": {
      "name": "Laptop",
      "price": 999.99
    }
  }
},
{
  "user": {
    "id": 2,
    "profile": {
      "name": "Bob",
      "age": 22
    }
  },
  "product": {
    "sku": "A123w",
    "details": {
      "name": "Laptop",
      "price": 9.99
    }
  }
}

Solution

  • This is definitely not the way, how it should be done, but I'll suggest a workaround for your certain situation.

    json.load(f) returns JSONDecodeError: Extra data: line

    The problem is that your "json" is not really a json, because there are missing brackets [] for list of objects and a lot of duplicated keys. But as a workaround you can do the following:

    import json
    
    with open("test.json", "r") as file:
        str_data: str = file.read()
        data: list[dict] = json.loads(f"[{str_data}]")
    
    for item in data:
        ...