Keras/Tensorflow: Train multiple models on the same GPU in a loop or using Process

I have multiple models to train in Keras/Tensorflow at a stretch one after the other without manually calling train.py, so I did:

for i in range(0, max_count):
    model = get_model(i)   # returns ith model
    model.fit(...)
    model.save(...)

It runs fine for the i=0 (and in fact runs perfectly when run separately). The problem is that when the second time model is loaded, i get ResourceExhaustedError OOM, so I tried to release memory at the end of for loop

del model
keras.backend.clear_session()
tf.clear_session()
tf.reset_default_graph()
gc.collect()

none of which individually or collectively works.

I looked up further and found that the only way to release GPU memory is to end the process.

Also rom this keras issue

Update (2018/08/01): Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.

So clearly the way to go is to create a process everytime I load a model and wait for it to end and then create another one in a fresh process like here:

import multitasking
def train_model_in_new_process(model_module, kfold_object, x, y, x_test, y_test, epochs, model_file_count):
    training_process = multiprocessing.Process(target=train_model, args=(x, y, x_test, y_test, epochs, model_file_count, ))
    training_process.start()
    training_process.join()

but then it throws this error:

  File "train.py", line 110, in train_model_in_new_process
    training_process.start()
  File "F:\Python\envs\tensorflow\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
Using TensorFlow backend.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

I really can't utilize the information presented in the error to see what I was doing wrong. It is clearly pointing at the line training_process.start(), but I can't seem to understand what's causing the problem.

Any help to train models either using for loop or using Process is appreciated.

Solution

Apparently, Multiprocessing doesn't like modules or more precisely importlib modules. I was loading models from numbered .py files using importlib

model_module = importlib.import_module(model_file)

and hence the trouble.

I did the same inside the Process and it was all fine :)

But I still could NOT find a way to do this without Processes, using fors. If you have an answer, please post it here, you're welcome. But anyway, I'm continuing with processes, because processes are, I believe, cleaner in a way that they are isolated and clears all the memory allocated for that specific one when done.