Search code examples
pythontarfile

Filtering tarfile.add using individual file names


Is there a way to access the individual file names in the filter lambda when adding a directory using tarfile.add?

I'm using the tarfile module to create archives of project directories. Some of these files I no longer need, and I'd like to ignore:

myproj/  # example; actual project directory structure much deeper
    importantfile.txt
    semi-importantfile.doc
    useless-file.exe  # ignore this one

What I am doing right now is using tarfile.add's exclude parameter to skip useless-file.exe.

import tarfile

with tarfile.open('mytar.tar', 'w') as mytar:
    mytar.add('myproj', exclude=lambda x: os.path.basename(x) == 'useless-file.exe')

I'm aware that exclude is now deprecated, and in the interest of future-proofing I'm trying to switch to using the new filter parameter.

    mytar.add('myproj', filter=lambda x: (
                                x if x.name != 'useless-file.exe'
                                else None))

However, doing this ends up adding useless-file.exe to the tarball. With some tests I discovered this is because, while exclude is fed the name of the directory and all its contents recursively, filter only gets the TarInfo for the file explicitly being added (in this case, the directory myproj)

So is there a way to replicate the behavior I had with exclude using filter? If it's possible, I would really rather not iterate through all of my directories recursively just to check that I don't add any unwanted files.

Explanation of Solution

See @larsks's answer for a complete explanation of the problem. My issue was when using exclude I called os.path.basename on x (see edited code above), but I forgot to do it on x.name when using filter.


Solution

  • I don't think the filter method behaves the way you think it does. For example, if I have a directory structure that looks like:

    example/
      file0.1
      file0.2
      dir1/
        file1.1
        file1.2
    

    And I run the following code:

    import tarfile
    
    def myfilter(thing):
        print('myfilter called for {thing.name}'.format(thing=thing))
        return thing
    
    t = tarfile.open('archive.tar', mode='w')
    t.add('example', recursive=True, filter=myfilter)
    

    I see as output:

    myfilter called for example
    myfilter called for example/file0.1
    myfilter called for example/file0.2
    myfilter called for example/dir1
    myfilter called for example/dir1/file1.1
    myfilter called for example/dir1/file1.2
    

    That is, the filter is getting called once per item added to the archive. If wanted to exclude example/dir1/file1.1, I would write a filter function that looked something like this:

    def exclude_file1(thing):
        if thing.name != 'example/dir1/file1.1':
            return thing
    

    When using this as the filter in the above example, the resulting archive contains:

    $ tar tf archive.tar 
    example/
    example/file0.1
    example/file0.2
    example/dir1/
    example/dir1/file1.2
    

    (edit: the above example was tested with Python 3.5)