Search code examples
pythonpython-2.7subprocesspopen7zip

How to programmatically count the number of files in an archive using python


In the program I maintain it is done as in:

# count the files in the archive
length = 0
command = ur'"%s" l -slt "%s"' % (u'path/to/7z.exe', srcFile)
ins, err = Popen(command, stdout=PIPE, stdin=PIPE,
                 startupinfo=startupinfo).communicate()
ins = StringIO.StringIO(ins)
for line in ins: length += 1
ins.close()
  1. Is it really the only way ? I can't seem to find any other command but it seems a bit odd that I can't just ask for the number of files
  2. What about error checking ? Would it be enough to modify this to:

    proc = Popen(command, stdout=PIPE, stdin=PIPE,
                 startupinfo=startupinfo)
    out = proc.stdout
    # ... count
    returncode = proc.wait()
    if returncode:
        raise Exception(u'Failed reading number of files from ' + srcFile)
    

    or should I actually parse the output of Popen ?

EDIT: interested in 7z, rar, zip archives (that are supported by 7z.exe) - but 7z and zip would be enough for starters


Solution

  • To count the number of archive members in a zip archive in Python:

    #!/usr/bin/env python
    import sys
    from contextlib import closing
    from zipfile import ZipFile
    
    with closing(ZipFile(sys.argv[1])) as archive:
        count = len(archive.infolist())
    print(count)
    

    It may use zlib, bz2, lzma modules if available, to decompress the archive.


    To count the number of regular files in a tar archive:

    #!/usr/bin/env python
    import sys
    import tarfile
    
    with tarfile.open(sys.argv[1]) as archive:
        count = sum(1 for member in archive if member.isreg())
    print(count)
    

    It may support gzip, bz2 and lzma compression depending on version of Python.

    You could find a 3rd-party module that would provide a similar functionality for 7z archives.


    To get the number of files in an archive using 7z utility:

    import os
    import subprocess
    
    def count_files_7z(archive):
        s = subprocess.check_output(["7z", "l", archive], env=dict(os.environ, LC_ALL="C"))
        return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders$', s).group(1))
    

    Here's version that may use less memory if there are many files in the archive:

    import os
    import re
    from subprocess import Popen, PIPE, CalledProcessError
    
    def count_files_7z(archive):
        command = ["7z", "l", archive]
        p = Popen(command, stdout=PIPE, bufsize=1, env=dict(os.environ, LC_ALL="C"))
        with p.stdout:
            for line in p.stdout:
                if line.startswith(b'Error:'): # found error
                    error = line + b"".join(p.stdout)
                    raise CalledProcessError(p.wait(), command, error)
        returncode = p.wait()
        assert returncode == 0
        return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders', line).group(1))
    

    Example:

    import sys
    
    try:
        print(count_files_7z(sys.argv[1]))
    except CalledProcessError as e:
        getattr(sys.stderr, 'buffer', sys.stderr).write(e.output)
        sys.exit(e.returncode)
    

    To count the number of lines in the output of a generic subprocess:

    from functools import partial
    from subprocess import Popen, PIPE, CalledProcessError
    
    p = Popen(command, stdout=PIPE, bufsize=-1)
    with p.stdout:
        read_chunk = partial(p.stdout.read, 1 << 15)
        count = sum(chunk.count(b'\n') for chunk in iter(read_chunk, b''))
    if p.wait() != 0:
        raise CalledProcessError(p.returncode, command)
    print(count)
    

    It supports unlimited output.


    Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: stackoverflow.com/a/30984882/281545)

    bufsize=-1 means use the default I/O buffer size instead of bufsize=0 (unbuffered) on Python 2. It is a performance boost on Python 2. It is default on the recent Python 3 versions. You might get a short read (lose data) if on some earlier Python 3 versions where bufsize is not changed to bufsize=-1.

    This answer reads in chunks and therefore the stream is fully buffered for efficiency. The solution you've linked is line-oriented. bufsize=1 means "line buffered". There is minimal difference from bufsize=-1 otherwise.

    and also what the read_chunk = partial(p.stdout.read, 1 << 15) buys us ?

    It is equivalent to read_chunk = lambda: p.stdout.read(1<<15) but provides more introspection in general. It is used to implement wc -l in Python efficiently.