I have several sub-folders, each of which containing twitter files which are zipped. I want python to iterate through these sub-folders and turn them into regular JSON files. I have more than 300 sub-folders, each of which containing about 1000 or more of these zipped files. A sample of these files is named: 00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D"
Thanks in advance
I have tried the codes below, just to see if I can extract one of those files, but none worked.
import zipfile
zip_ref = zipfile.ZipFile('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0', 'r')
zip_ref.extractall('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
zip_ref.close()
import tarfile
tar = tarfile.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
tar.extractall()
tar.close
import gzip
import json
with gzip.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D'
, 'rb') as f:
d = json.loads(f.read().decode("utf-8"))
There is another very similar threat on stackover flow, but my question is different in that my zipped file is originally JSON, and when I use this last method I get this error: Exception has occurred: json.decoder.JSONDecodeError Expecting value: line 1 column 1 (char 0)
Simple script that answers the question: it traverses, checks if file (fname
) is a gzip (via magic number because I'm cynical) and unzips it.
import json
import gzip
import binascii
import os
def is_gz_file(filepath):
with open(filepath, 'rb') as test_f:
return binascii.hexlify(test_f.read(2)) == b'1f8b'
rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
for fname in fileList:
filepath = os.path.join(dirName,fname)
if is_gz_file(filepath):
f = gzip.open(filepath, 'rb')
json_content = json.loads(f.read())
print(json_content)
Tested and it works.