Search code examples
pythontensorflowkerasout-of-memory

Keras/Tensorflow: Train multiple models on the same GPU in a loop or using Process


I have multiple models to train in Keras/Tensorflow at a stretch one after the other without manually calling train.py, so I did:

for i in range(0, max_count):
    model = get_model(i)   # returns ith model
    model.fit(...)
    model.save(...)

It runs fine for the i=0 (and in fact runs perfectly when run separately). The problem is that when the second time model is loaded, i get ResourceExhaustedError OOM, so I tried to release memory at the end of for loop

del model
keras.backend.clear_session()
tf.clear_session()
tf.reset_default_graph()
gc.collect()

none of which individually or collectively works.

I looked up further and found that the only way to release GPU memory is to end the process.

Also rom this keras issue

Update (2018/08/01): Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.

So clearly the way to go is to create a process everytime I load a model and wait for it to end and then create another one in a fresh process like here:

import multitasking
def train_model_in_new_process(model_module, kfold_object, x, y, x_test, y_test, epochs, model_file_count):
    training_process = multiprocessing.Process(target=train_model, args=(x, y, x_test, y_test, epochs, model_file_count, ))
    training_process.start()
    training_process.join()

but then it throws this error:

  File "train.py", line 110, in train_model_in_new_process
    training_process.start()
  File "F:\Python\envs\tensorflow\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
Using TensorFlow backend.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

I really can't utilize the information presented in the error to see what I was doing wrong. It is clearly pointing at the line training_process.start(), but I can't seem to understand what's causing the problem.

Any help to train models either using for loop or using Process is appreciated.


Solution

  • Apparently, Multiprocessing doesn't like modules or more precisely importlib modules. I was loading models from numbered .py files using importlib

    model_module = importlib.import_module(model_file)
    

    and hence the trouble.

    I did the same inside the Process and it was all fine :)

    But I still could NOT find a way to do this without Processes, using fors. If you have an answer, please post it here, you're welcome. But anyway, I'm continuing with processes, because processes are, I believe, cleaner in a way that they are isolated and clears all the memory allocated for that specific one when done.