Search code examples
pythonglob

python glob and iglob when iterating over two directories


When I try iterate over two directories, the first with a smaller number of files, the second with a larger number of files, I run into a problem: I wanted to use iglob for the large number of files, but this doesn't work.

large_n_files = glob.iglob(pathtodir)
small_n_files = glob.iglob(pathtootherdir)

for s in small_n_files:
    for l in large_n_files:
        print(l,s)

yields (assuming e.g. small_n = 2, large_n = 3)

l1 s1
l2 s1
l3 s1

When I switch to glob for the large_n_files, I get the result that I want, i.e.

large_n_files = glob.glob(pathtodir)
small_n_files = glob.iglob(pathtootherdir)

for s in small_n_files:
    for l in large_n_files:
        print(l,s)

yields

l1 s1
l2 s1
l3 s1
l1 s2
l2 s2
l3 s2

Why is this so? (I guess I have to learn more about iterators...) If I want to use this for a really large number of files wouldn't glob be less efficient? How can I work around this?


Solution

  • When you do :

    small_n_files = glob.iglob(pathtootherdir)
    

    You get back in the iterator; this means you can iterate over it only once.

    on the other hand when you do:

    large_n_files = glob.glob(pathtodir)
    

    then you create a list, which you can iterate multiple times. (it creates an iterator object for each loop of small_n_files). but you have the full list in memory.

    if you don't want to hold the large_n_files in memory (as it's to big), you can use the following code:

    small_n_files = glob.iglob(pathtootherdir)
    
        for s in small_n_files:
            for l in glob.iglob(pathtodir):
                print(l,s)
    

    This way you never have the full list of pathtodir in memory.