Search code examples
pythonjsongzipijson

How to efficiently parse a large gzipped JSON file with ijson without encountering "trailing garbage" errors?


I am working with a large gzipped JSON file containing review data, formatted as a list of JSON objects. Each object is separated by a newline character. My goal is to efficiently extract the review_text field from each object using ijson without loading the entire file into memory, as the file contains over 15 million records.

However, when trying to parse the file using ijson, I encounter the following error:

IncompleteJSONError: parse error: trailing garbage
          votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
                     (right here) ------^

This is the actual sample...

['{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "24375664", "review_id": "5cd416f3efc3f944fce4ce2db2290d5e", "rating": 5, "review_text": "Mind blowingly cool. Best science fiction I\'ve read in some time. I just loved all the descriptions of the society of the future - how they lived in trees, the notion of owning property or even getting married was gone. How every surface was a screen. \\n The undulations of how society responds to the Trisolaran threat seem surprising to me. Maybe its more the Chinese perspective, but I wouldn\'t have thought the ETO would exist in book 1, and I wouldn\'t have thought people would get so over-confident in our primitive fleet\'s chances given you have to think that with superior science they would have weapons - and defenses - that would just be as rifles to arrows once were. \\n But the moment when Luo Ji won as a wallfacer was just too cool. I may have actually done a fist pump. Though by the way, if the Dark Forest theory is right - and I see no reason why it wouldn\'t be - we as a society should probably stop broadcasting so much signal out into the universe.", "date_added": "Fri Aug 25 13:55:02 -0700 2017", "date_updated": "Mon Oct 09 08:55:59 -0700 2017", "read_at": "Sat Oct 07 00:00:00 -0700 2017", "started_at": "Sat Aug 26 00:00:00 -0700 2017", "n_votes": 16, "n_comments": 0}\n', '{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "18245960", "review_id": "dfdbb7b0eb5a7e4c26d59a937e2e5feb", "rating": 5, "review_text": "This is a special book. It started slow for about the first third, then in the middle third it started to get interesting, then the last third blew my mind. This is what I love about good science fiction - it pushes your thinking about where things can go. \\n It is a 2015 Hugo winner, and translated from its original Chinese, which made it interesting in just a different way from most things I\'ve read. For instance the intermixing of Chinese revolutionary history - how they kept accusing people of being \\"reactionaries\\", etc. \\n It is a book about science, and aliens. The science described in the book is impressive - its a book grounded in physics and pretty accurate as far as I could tell. Though when it got to folding protons into 8 dimensions I think he was just making stuff up - interesting to think about though. \\n But what would happen if our SETI stations received a message - if we found someone was out there - and the person monitoring and answering the signal on our side was disillusioned? That part of the book was a bit dark - I would like to think human reaction to discovering alien civilization that is hostile would be more like Enders Game where we would band together. \\n I did like how the book unveiled the Trisolaran culture through the game. It was a smart way to build empathy with them and also understand what they\'ve gone through across so many centuries. And who know a 3 body problem was an unsolvable math problem? But I still don\'t get who made the game - maybe that will come in the next book. \\n I loved this quote: \\n \\"In the long history of scientific progress, how many protons have been smashed apart in accelerators by physicists? How many neutrons and electrons? Probably no fewer than a hundred million. Every collision was probably the end of the civilizations and intelligences in a microcosmos. In fact, even in nature, the destruction of universes must be happening at every second--for example, through the decay of neutrons. Also, a high-energy cosmic ray entering the atmosphere may destroy thousands of such miniature universes....\\"", "date_added": "Sun Jul 30 07:44:10 -0700 2017", "date_updated": "Wed Aug 30 00:00:26 -0700 2017", "read_at": "Sat Aug 26 12:05:52 -0700 2017", "started_at": "Tue Aug 15 13:23:18 -0700 2017", "n_votes": 28, "n_comments": 1}\n']

here is the original dataset under book reviews: https://mengtingwan.github.io/data/goodreads.html#datasets

  • Using ijson to parse the file directly, but it leads to trailing garbage errors.
  • Cleaning up each line before parsing it as JSON, which partially works but isn’t efficient for large files.

When using ijson, I keep running into the following error:

IncompleteJSONError: parse error: trailing garbage
          votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
                     (right here) ------^

How can I efficiently parse the large gzipped JSON file using ijson or any other approach that avoids loading the entire file into memory and does not result in the "trailing garbage" error? What adjustments can I make to handle this file format correctly?

Here is the current attempted code that produces that error

import ijson
import pandas as pd
import gzip

review_texts = []

gzip_file_path = 'goodreads_dataset.json.gz'

with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
    objects = ijson.items(f, 'item')  # Use 'item' if it's a top-level array

    for obj in objects:
        if 'review_text' in obj:
            review_texts.append(obj['review_text'])

df = pd.DataFrame(review_texts, columns=['review_text'])
df.to_pickle('reviews.pkl')

print(f"Saved {len(df)} review_text entries to 'reviews.pkl')

Solution

  • The data file contains a JSON on every line, so its format is actually JSON Lines and the archive extension should be .jsonl.gz.

    You can simply read the file line by line, and use the regular json module to parse the JSON on every line:

    import gzip
    import json
    
    review_texts = []
    
    gzip_file_path = 'goodreads_dataset.json.gz'
    
    with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
        for line in f:
            obj = json.loads(line)
            review_texts.append(obj['review_text'])