When I try iterate over two directories, the first with a smaller number of files, the second with a larger number of files, I run into a problem: I wanted to use iglob for the large number of files, but this doesn't work.
large_n_files = glob.iglob(pathtodir)
small_n_files = glob.iglob(pathtootherdir)
for s in small_n_files:
for l in large_n_files:
print(l,s)
yields (assuming e.g. small_n = 2, large_n = 3)
l1 s1
l2 s1
l3 s1
When I switch to glob
for the large_n_files, I get the result that I want, i.e.
large_n_files = glob.glob(pathtodir)
small_n_files = glob.iglob(pathtootherdir)
for s in small_n_files:
for l in large_n_files:
print(l,s)
yields
l1 s1
l2 s1
l3 s1
l1 s2
l2 s2
l3 s2
Why is this so? (I guess I have to learn more about iterators...) If I want to use this for a really large number of files wouldn't glob be less efficient? How can I work around this?
When you do :
small_n_files = glob.iglob(pathtootherdir)
You get back in the iterator; this means you can iterate over it only once.
on the other hand when you do:
large_n_files = glob.glob(pathtodir)
then you create a list, which you can iterate multiple times. (it creates an iterator object for each loop of small_n_files). but you have the full list in memory.
if you don't want to hold the large_n_files in memory (as it's to big), you can use the following code:
small_n_files = glob.iglob(pathtootherdir)
for s in small_n_files:
for l in glob.iglob(pathtodir):
print(l,s)
This way you never have the full list of pathtodir in memory.