Here's a funny one. I was actually writing an answer for another question when I found some unexpected result using filter
or generator. I have a list of file paths:
paths = ['/directoryb/baba.txt', '/directorya/nigel.txt', '/directoryb/ralph.txt', '/directorya/jim.txt'
I make the set of different directories in the paths list:
from os.path import dirname
dirs = {dirname(path) for path in paths}
And now I want to make a list of generators (or even a generator of generators), each one containing the elements of paths
in the same directory. And so I do:
dirs_iter = [(path for path in paths if path.startswith(dir)) for dir in dirs]
Wasn't I surprised after running:
for dir_iter in dirs_iter:
for path in dir_iter:
print(path)
And obtaining the following:
/directorya/nigel.txt
/directorya/jim.txt
/directorya/nigel.txt
/directorya/jim.txt
This is clearly wrong. And yet, if I use the following sentence:
# now I'm generating the lists instead of using generators
dirs_iter = [[path for path in paths if path.startswith(dir)] for dir in dirs]
The printing loop shows the expected answer:
/directoryb/baba.txt
/directoryb/ralph.txt
/directorya/nigel.txt
/directorya/jim.txt
If I use filter
and/or map
instead of generators:
dirs_iter = map(lambda dir: filter(lambda path: path.startswith(dir), paths), dirs)
I get the wrong answer too EDIT: The map
/filter
version actually works.
What's going on here?
The name dir
is a closure and looked up when executing the generator, not when defining it. By that time dir
was last bound to the last value in dirs
:
>>> from os.path import dirname
>>> paths = ['/directoryb/baba.txt', '/directorya/nigel.txt', '/directoryb/ralph.txt', '/directorya/jim.txt']
>>> dirs = {dirname(path) for path in paths}
>>> def echo(value):
... print('echoing:', value)
... return value
...
>>> dirs_iter = [(path for path in paths if path.startswith(echo(dir))) for dir in dirs]
>>> for dir_iter in dirs_iter:
... print('Iterating over the next dir_iter generator')
... for path in dir_iter:
... print(path)
...
Iterating over the next dir_iter generator
echoing: /directoryb
/directoryb/baba.txt
echoing: /directoryb
echoing: /directoryb
/directoryb/ralph.txt
echoing: /directoryb
Iterating over the next dir_iter generator
echoing: /directoryb
/directoryb/baba.txt
echoing: /directoryb
echoing: /directoryb
/directoryb/ralph.txt
echoing: /directoryb
>>> list(dirs)
['/directorya', '/directoryb']
Because Python 3 uses a random hash seed, in my run /directoryb
was last rather than /directorya
, but you can see that only when we actually iterated over the dir_iter
generator that the dir
value is accessed (and echoed), and that at that time it is set to one value. The list(dirs)
line shows in what order the dirs
set is yielding its values.
Note that filter()
does not have this problem; your map()
and filter()
combo works just fine:
>>> dirs_iter = map(lambda dir: filter(lambda path: path.startswith(dir), paths), dirs)
>>> for dir_iter in dirs_iter:
... for path in dir_iter:
... print(path)
...
/directorya/nigel.txt
/directorya/jim.txt
/directoryb/baba.txt
/directoryb/ralph.txt