Search code examples
pythontableofcontentstarfile

Navigating a large tar.gz file in python without extracting it first


I have seen this question but I need something else.

My files contains a very large amount of text files (hundreds of thousands) organized by variable name. Something like

filename/maxvalue/IDXstation.txt     (with X that goes from 100000 to 200000)
filename/minvalue/IDXstation.txt  
filename/meanvalue/IDXstation.txt 

and so on. Problem is that I don't have a readme.txt files that tells me how many folders are in the tar files or how they are named (I made them up) (or how many stations are in each folder). For now all I care to read is the structure of the filename.tar.gz and print something like

filename/maxvalue/  
filename/minvalue/  
filename/meanvalue/

I need to read the structure of it before I start extracting the file, because I am interested only in some folders and not all of them.

if I use

for tarinfo in tar:
    print tarinfo.name

It will print all the files, and they are hundreds of thousands and I don't want that, but I am not sure how to set it up.


Solution

  • To print top level directories in the tar archive e.g., upto the second level:

    #!/usr/bin/env python
    import sys
    import tarfile
    
    with tarfile.open(sys.argv[1]) as archive:
        for member in archive:
            if member.isdir() and member.name.count('/') < 2:
                print(member.name)
    

    Usage:

    $ print-top-level-dirs <tar-archive>