Search code examples
pythonpython-3.xgenerator-expression

Unexpected result with filter or generators


Here's a funny one. I was actually writing an answer for another question when I found some unexpected result using filter or generator. I have a list of file paths:

paths = ['/directoryb/baba.txt', '/directorya/nigel.txt', '/directoryb/ralph.txt', '/directorya/jim.txt'

I make the set of different directories in the paths list:

from os.path import dirname
dirs = {dirname(path) for path in paths}

And now I want to make a list of generators (or even a generator of generators), each one containing the elements of paths in the same directory. And so I do:

dirs_iter = [(path for path in paths if path.startswith(dir)) for dir in dirs]

Wasn't I surprised after running:

for dir_iter in dirs_iter:
    for path in dir_iter:
        print(path)

And obtaining the following:

/directorya/nigel.txt
/directorya/jim.txt
/directorya/nigel.txt
/directorya/jim.txt

This is clearly wrong. And yet, if I use the following sentence:

# now I'm generating the lists instead of using generators
dirs_iter = [[path for path in paths if path.startswith(dir)] for dir in dirs]

The printing loop shows the expected answer:

/directoryb/baba.txt
/directoryb/ralph.txt
/directorya/nigel.txt
/directorya/jim.txt

If I use filter and/or map instead of generators:

dirs_iter = map(lambda dir: filter(lambda path: path.startswith(dir), paths), dirs)

I get the wrong answer too EDIT: The map/filter version actually works.

What's going on here?


Solution

  • The name dir is a closure and looked up when executing the generator, not when defining it. By that time dir was last bound to the last value in dirs:

    >>> from os.path import dirname
    >>> paths = ['/directoryb/baba.txt', '/directorya/nigel.txt', '/directoryb/ralph.txt', '/directorya/jim.txt']
    >>> dirs = {dirname(path) for path in paths}
    >>> def echo(value):
    ...     print('echoing:', value)
    ...     return value
    ... 
    >>> dirs_iter = [(path for path in paths if path.startswith(echo(dir))) for dir in dirs]
    >>> for dir_iter in dirs_iter:
    ...     print('Iterating over the next dir_iter generator')
    ...     for path in dir_iter:
    ...         print(path)
    ... 
    Iterating over the next dir_iter generator
    echoing: /directoryb
    /directoryb/baba.txt
    echoing: /directoryb
    echoing: /directoryb
    /directoryb/ralph.txt
    echoing: /directoryb
    Iterating over the next dir_iter generator
    echoing: /directoryb
    /directoryb/baba.txt
    echoing: /directoryb
    echoing: /directoryb
    /directoryb/ralph.txt
    echoing: /directoryb
    >>> list(dirs)
    ['/directorya', '/directoryb']
    

    Because Python 3 uses a random hash seed, in my run /directoryb was last rather than /directorya, but you can see that only when we actually iterated over the dir_iter generator that the dir value is accessed (and echoed), and that at that time it is set to one value. The list(dirs) line shows in what order the dirs set is yielding its values.

    Note that filter() does not have this problem; your map() and filter() combo works just fine:

    >>> dirs_iter = map(lambda dir: filter(lambda path: path.startswith(dir), paths), dirs)
    >>> for dir_iter in dirs_iter:
    ...     for path in dir_iter:
    ...         print(path)
    ... 
    /directorya/nigel.txt
    /directorya/jim.txt
    /directoryb/baba.txt
    /directoryb/ralph.txt