I am trying to unzip some .json.gz
files, but gzip
adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b'
(as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b'
is what makes the file unworkable for the next stage. Do you have any solution to remove the b'
? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link just a few json.gz files
The problem isn't with that b
prefix you're seeing with print(f.read())
, which just means the data is a bytes
sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads()
will accept either. The JSONDecodeError
is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json
module doesn't (directly) support.
Dunes' answer to the question @Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.