I have multiple models to train in Keras/Tensorflow at a stretch one after the other without manually calling train.py
, so I did:
for i in range(0, max_count):
model = get_model(i) # returns ith model
model.fit(...)
model.save(...)
It runs fine for the i=0
(and in fact runs perfectly when run separately). The problem is that when the second time model is loaded, i get ResourceExhaustedError OOM
, so I tried to release memory at the end of for loop
del model
keras.backend.clear_session()
tf.clear_session()
tf.reset_default_graph()
gc.collect()
none of which individually or collectively works.
I looked up further and found that the only way to release GPU memory is to end the process.
Also rom this keras issue
Update (2018/08/01): Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.
So clearly the way to go is to create a process everytime I load a model and wait for it to end and then create another one in a fresh process like here:
import multitasking
def train_model_in_new_process(model_module, kfold_object, x, y, x_test, y_test, epochs, model_file_count):
training_process = multiprocessing.Process(target=train_model, args=(x, y, x_test, y_test, epochs, model_file_count, ))
training_process.start()
training_process.join()
but then it throws this error:
File "train.py", line 110, in train_model_in_new_process
training_process.start()
File "F:\Python\envs\tensorflow\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\Python\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)
File "F:\Python\envs\tensorflow\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
Using TensorFlow backend.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
I really can't utilize the information presented in the error to see what I was doing wrong. It is clearly pointing at the line training_process.start()
, but I can't seem to understand what's causing the problem.
Any help to train models either using for
loop or using Process
is appreciated.
Apparently, Multiprocessing
doesn't like modules
or more precisely importlib
modules. I was loading models from numbered .py
files using importlib
model_module = importlib.import_module(model_file)
and hence the trouble.
I did the same inside the Process
and it was all fine :)
But I still could NOT find a way to do this without Process
es, using for
s. If you have an answer, please post it here, you're welcome. But anyway, I'm continuing with processes, because processes are, I believe, cleaner in a way that they are isolated and clears all the memory allocated for that specific one when done.