Search code examples
pythonpandastarfile

Can pandas read and archive within an archive?


I have an archive file (archive.tar.gz) which contains multiple archive files (file.txt.gz).

If I first extract the .txt.gz files to a folder, I can then open them with pandas directly using:

import pandas as pd

df = pd.read_csv('file.txt.gz', sep='\t', encoding='utf-8')

But if I explore the archive using the tarfile library, then it doesn't work:

import pandas as pd
import tarfile

tar = tarfile.open("archive.tar.gz", "r:*")
csv_path = tar.getnames()[1]
df = pd.read_csv(tar.extractfile(csv_path), sep='\t', encoding='utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Is that possible to do?


Solution

  • When you open the file by filename, then Pandas will be able to infer that it is compressed with gzip due to the *.gz extension on the filename.

    When you pass it a file object, you need to tell it explicitly about the compression so that it can decompress it as it reads the file.

    This should work:

    df = pd.read_csv(
        tar.extractfile(csv_path),
        compression='gzip',
        sep='\t',
        encoding='utf-8')
    

    For more details, see the entry about the "compression" argument in the documentation for read_csv().