Is there a way to access the individual file names in the filter
lambda when adding a directory using tarfile.add
?
I'm using the tarfile
module to create archives of project directories. Some of these files I no longer need, and I'd like to ignore:
myproj/ # example; actual project directory structure much deeper
importantfile.txt
semi-importantfile.doc
useless-file.exe # ignore this one
What I am doing right now is using tarfile.add
's exclude
parameter to skip useless-file.exe
.
import tarfile
with tarfile.open('mytar.tar', 'w') as mytar:
mytar.add('myproj', exclude=lambda x: os.path.basename(x) == 'useless-file.exe')
I'm aware that exclude
is now deprecated, and in the interest of future-proofing I'm trying to switch to using the new filter
parameter.
mytar.add('myproj', filter=lambda x: (
x if x.name != 'useless-file.exe'
else None))
However, doing this ends up adding useless-file.exe
to the tarball. With some tests I discovered this is because, while exclude
is fed the name of the directory and all its contents recursively, filter
only gets the TarInfo
for the file explicitly being added (in this case, the directory myproj
)
So is there a way to replicate the behavior I had with exclude
using filter
? If it's possible, I would really rather not iterate through all of my directories recursively just to check that I don't add any unwanted files.
See @larsks's answer for a complete explanation of the problem. My issue was when using exclude
I called os.path.basename
on x
(see edited code above), but I forgot to do it on x.name
when using filter
.
I don't think the filter
method behaves the way you think it does. For example, if I have a directory structure that looks like:
example/
file0.1
file0.2
dir1/
file1.1
file1.2
And I run the following code:
import tarfile
def myfilter(thing):
print('myfilter called for {thing.name}'.format(thing=thing))
return thing
t = tarfile.open('archive.tar', mode='w')
t.add('example', recursive=True, filter=myfilter)
I see as output:
myfilter called for example
myfilter called for example/file0.1
myfilter called for example/file0.2
myfilter called for example/dir1
myfilter called for example/dir1/file1.1
myfilter called for example/dir1/file1.2
That is, the filter is getting called once per item added to the archive. If wanted to exclude example/dir1/file1.1
, I would write a filter function that looked something like this:
def exclude_file1(thing):
if thing.name != 'example/dir1/file1.1':
return thing
When using this as the filter in the above example, the resulting archive contains:
$ tar tf archive.tar
example/
example/file0.1
example/file0.2
example/dir1/
example/dir1/file1.2
(edit: the above example was tested with Python 3.5)