I just downloaded a .gz file which contains a lot of folders and files, and among them is a .txt file containing German sentences.
url = 'https://pcai056.informatik.uni-leipzig.de/downloads/corpora/{}'
filename = 'deu-be_web_2013_10K.tar.gz'
with gzip.open(filename, 'wb') as gz:
download_url = url.format(filename)
r = requests.get(download_url)
gz.write(r.content)
the .txt file is all I need, and I wonder how I can only extract this one, if that's possible. All I've managed to do is read in the entire file and then write it into a .txt file, but it's messy and contains a lot of unneeded text.
with gzip.open(path, 'rb') as gz, open('something.txt', 'wb') as f:
content = gz.read()
f.write(content)
That's not just a .gz file. It's a .tar.gz file, where tar is an archive format that combines multiple files into a single file, and gzip was used to compress that single file. gzip can only extract the single tar file, but then you need something to interpret the tar file format to extract one of the files contained within.
Use tarfile, not gzip. Opening with "r:gz"
will do the decompression for you as well.