Search code examples
pythonjsongzipjsonlines

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file


I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.

What do you think is the problem, and how can I solve it?

If I use unzipping software such as 7zip to unzip the file, this problem disappears.

This is my code:

with gzip.open('filename' , 'rb') as f:
    json_content = json.loads(f.read())

This is the error I get:

Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)

I used this code:

with gzip.open ('filename', mode='rb') as f:
    print(f.read())

and realized that the file starts with b' (as shown below):

b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"

I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.

I uploaded a sample of these files in the following link just a few json.gz files


Solution

  • The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.

    Dunes' answer to the question @Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.

    Here's what I mean:

    import json
    import gzip
    
    
    filename = '00_activities.json.gz'  # Sample file.
    
    json_content = []
    with gzip.open(filename , 'rb') as gzip_file:
        for line in gzip_file:  # Read one line.
            line = line.rstrip()
            if line:  # Any JSON data on it?
                obj = json.loads(line)
                json_content.append(obj)
    
    print(json.dumps(json_content, indent=4))  # Pretty-print data parsed.    
    

    Note that the output it prints shows what valid JSON might have looked like.