The BatchNormalization layer of my Keras model (using Tensorflow) does not work and return an InternalError exception at training time.
Here is the line defining the BatchNormalization layer in my model :
bn = BatchNormalization(axis=3)(grid)
I create 2 models (1 before, 1 after) in order to debug the model :
debug = Model(inputs=[question1, question2], outputs=grid)
debug.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
bn = BatchNormalization(axis=3)(grid)
debug2 = Model(inputs=[question1, question2], outputs=bn)
debug2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
And then I predict some random data, just to actually predict anything :
pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
print(pred[0].shape)
And the result is :
(2, 25)
2/2 [==============================] - 2s 1s/step
(25, 25, 600)
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1291 try:
-> 1292 return fn(*args)
1293 except errors.OpError as e:
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1276 return self._call_tf_sessionrun(
-> 1277 options, feed_dict, fetch_list, target_list, run_metadata)
1278
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1366 self._session, options, feed_dict, fetch_list, target_list,
-> 1367 run_metadata)
1368
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
InternalError Traceback (most recent call last)
<ipython-input-11-748dc132eac2> in <module>()
4 pred = debug.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
5 print(pred[0].shape)
----> 6 pred = debug2.predict([Q1_test_debug, Q2_test_debug], verbose=1, batch_size=1)
7 print(pred[0].shape)
~/.local/lib/python3.5/site-packages/keras/engine/training.py in predict(self, x, batch_size, verbose, steps)
1833 f = self.predict_function
1834 return self._predict_loop(f, ins, batch_size=batch_size,
-> 1835 verbose=verbose, steps=steps)
1836
1837 def train_on_batch(self, x, y,
~/.local/lib/python3.5/site-packages/keras/engine/training.py in _predict_loop(self, f, ins, batch_size, verbose, steps)
1329 ins_batch[i] = ins_batch[i].toarray()
1330
-> 1331 batch_outs = f(ins_batch)
1332 if not isinstance(batch_outs, list):
1333 batch_outs = [batch_outs]
~/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2480 session = get_session()
2481 updated = session.run(fetches=fetches, feed_dict=feed_dict,
-> 2482 **self.session_kwargs)
2483 return updated[:len(self.outputs)]
2484
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
885 try:
886 result = self._run(None, fetches, feed_dict, options_ptr,
--> 887 run_metadata_ptr)
888 if run_metadata:
889 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1108 if final_fetches or final_targets or (handle and feed_dict_tensor):
1109 results = self._do_run(handle, final_targets, final_fetches,
-> 1110 feed_dict_tensor, options, run_metadata)
1111 else:
1112 results = []
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1284 if handle is None:
1285 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1286 run_metadata)
1287 else:
1288 return self._do_call(_prun_fn, handle, feeds, fetches)
~/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1306 self._config.experimental.client_handles_error_formatting):
1307 message = error_interpolation.interpolate(message, self._graph)
-> 1308 raise type(e)(node_def, op, message)
1309
1310 def _extend_graph(self):
InternalError: cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'batch_normalization_1/FusedBatchNorm', defined at:
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/home/remondn/.local/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 497, in start
self.io_loop.start()
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 132, in start
self.asyncio_loop.run_forever()
File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
self._run_once()
File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
handle._run()
File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
self._callback(*self._args)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 122, in _handle_events
handler_func(fileobj, events)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/home/remondn/.local/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
raw_cell, store_history, silent, shell_futures)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2901, in run_ast_nodes
if self.run_code(code, result):
File "/home/remondn/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-44a967130b40>", line 87, in <module>
bn = BatchNormalization(axis=3)(grid)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/engine/topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/layers/normalization.py", line 181, in call
epsilon=self.epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1831, in normalize_batch_in_training
epsilon=epsilon)
File "/home/remondn/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1806, in _fused_normalize_batch_in_training
data_format=tf_data_format)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/nn_impl.py", line 909, in fused_batch_norm
name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3466, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/home/remondn/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): cuDNN launch failure : input shape ([1,600,25,25])
[[{{node batch_normalization_1/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_1/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](batch_normalization_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, batch_normalization_1/gamma/read, batch_normalization_1/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
[[{{node batch_normalization_1/cond/Merge/_949}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_133_batch_normalization_1/cond/Merge", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
A few things that I don't understand :
(25, 25, 600)
), the format of the output of previous layer / input of BatchNormalization is in the format channels_last
. But the error report input shape ([1,600,25,25])
, which is in the format channels_first
. Why it suddenly changed ?axis = 3
, but in the error, we have FusedBatchNorm [...] data_format="NCHW"
, indicating a channels_first
format. No matter which axis I choose (I tried 1, 2, 0, -1), I always have this error with this data_format. What it is not changing when I am changing the axis of BatchNormalizationDoes anyone have any idea how to fix this ?
Turns out, versions of libraries I was using were messed up.
I don't know why, but everything else was working (actually, removing the BatchNormalization layer led to a working network...)
Anyway I updated my package to use CUDA 9.0 with cuDNN 7.0.5 and tensorflow-gpu 1.10.0
The links I used to get matching versions between all of these :