Python ijson - parse error: trailing garbage // bz2.decompress()

I have come across an error while parsing json with ijson.

Background: I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.

Attempt: I have managed to decompress the files using bz2.decompress with the following code:

## Code in loop specific for decompressing and parsing - 

with open(file, 'rb') as source:
                # Decompress the file
                json_r = bz2.decompress(source.read())
                json_decom =  json_r.decode('utf-8') # decompresses one file at a time rather than a stream
                
                # Parse the JSON with ijson 
                parser = ijson.parse(json_decom)
                for prefix, event, value in parser:
                    # Print selected items as part of testing
                    if prefix=="created_at":
                        print(value)
                    if prefix=="text":
                        print(value)
                    if prefix=="user.id_str":
                        print(value)

This gives the following error:

IncompleteJSONError: parse error: trailing garbage
          estamp_ms":"1609466366680"}  {"created_at":"Fri Jan 01 01:59
                     (right here) ------^

Two things:

Is my decompression method correct and giving the right type of file for ijson to parse (ijson takes both bytes and str)?
Is is a JSON error? // If it is a JSON error is it possible to develop some kind of error handler to move to the next file - if so any suggestion would be appreciated?

Any assistance would be greatly appreciated.

Thank you, James

Solution

To directly answer your two questions:

The decompression method is correct in the sense that it yields JSON data that you then feed to ijson. As you point out, ijson works both with str and bytes inputs (although the latter is preferred); if you were giving ijson some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values option (see docs for details).

About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.

A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:

with bz2.BZ2File(filename, mode='r') as f:
    for prefix, event, value in ijson.parse(f):
        # ...

As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:

def json_input():
   with bz2.BZ2File(filename, mode='r') as f:
       for prefix, event, value in ijson.parse(f):
           # yield your results

df = pandas.DataFrame(data=json_input())