Search code examples
pythonmachine-learningkerasgenerator

Joining two DirectoryIterators in Keras


Suppose I have something like the following:

image_data_generator = ImageDataGenerator(rescale=1./255)

train_generator = image_data_generator.flow_from_directory(
  'my_directory',
  target_size=(28, 28),
  batch_size=32,
  class_mode='categorical'
)

Then my train_generator is filled with data from my_directory, which contains two subfolders which separate the data into classes 0 and 1.

Suppose also I have another directory that_directory, also with data split into classes 0 and 1. I want to augment my train_generator with this additional data.

Running train_generator = image_data_generator.flow_from_directory('that_directory', ...) removes the prior data from my_directory.

Is there a way to augment or append both sets of data into one generator or an object that operates like a DirectoryIterator without changing the folder structure itself?


Solution

  • Just combine the generators in another generator, optionally with different augmentation configs:

    idg1 = ImageDataGenerator(**idg1_configs)
    idg2 = ImageDataGenerator(**idg2_configs)
    
    g1 = idg1.flow_from_directory('idg1_dir',...)
    g2 = idg2.flow_from_directory('idg2_dir',...)
    
    def combine_gen(*gens):
        while True:
            for g in gens:
                yield next(g)
    
    # ...
    model.fit_generator(combine_gen(g1, g2), steps_per_epoch=len(g1)+len(g2), ...)
    

    This would alternately generate batches from g1 and g2.

    Note that one might suggest using itertools.chain, however you can't use that here since ImageDataGenerators generators are never-ending and ceaselessly generate batches of data. This is expected for the generator you pass to fit_generator method. From Keras doc:

    ...The generator is expected to loop over its data indefinitely. An epoch finishes when steps_per_epoch batches have been seen by the model.

    The steps_per_epoch if not set would default to len(generator) where generator is the generator you pass to fit_generator method. The ImageDataGenerator generators can give their length, so you don't need to manually set the steps_per_epoch argument. If you would like the same thing with combined generators above, you can use this solution instead:

    class CombinedGen():
        def __init__(self, *gens):
            self.gens = gens
    
        def generate(self):
            while True:
                for g in self.gens:
                    yield next(g)
    
        def __len__(self):
            return sum([len(g) for g in self.gens])
    
    # usage:
    cg = CombinedGen(g1, g2)
    model.fit_generator(cg.generate(), ...) # no need to set `steps_per_epoch`
    

    You can also add __next__ and/or __iter__ methods to CombinedGen class if you are interested to directly iterate over the objects of this class (instead of iterating over cg.generate()).