I'm getting a InvalidArgumentError on my embedding layer:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU
Cast: GPU CPU
Const: GPU CPU
ResourceSparseApplyAdagradV2: CPU
_Arg: GPU CPU
ReadVariableOp: GPU CPU
Colocation members, user-requested devices, and framework assigned devices, if any:
model_6_user_embedding_embedding_lookup_readvariableop_resource (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
adagrad_adagrad_update_1_update_0_resourcesparseapplyadagradv2_accum (_Arg) framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
model_6/User-Embedding/embedding_lookup/ReadVariableOp (ReadVariableOp)
model_6/User-Embedding/embedding_lookup/axis (Const)
model_6/User-Embedding/embedding_lookup (GatherV2)
gradient_tape/model_6/User-Embedding/embedding_lookup/Shape (Const)
gradient_tape/model_6/User-Embedding/embedding_lookup/Cast (Cast)
Adagrad/Adagrad/update_1/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0
[[{{node model_6/User-Embedding/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_2997]
Link to google colab: https://colab.research.google.com/drive/1ZN1HzSTTfvA_zstuI-EsKjw7Max1f73v?usp=sharing
It's a really simple neural network, and data is available to download from Kaggle - you could just drag and drop into colabs to get it working.
I've also tried to set soft device placement = True
tf.config.set_soft_device_placement(True)
but that doesn't seem to have worked.
From the error log, it looks like MirroredStrategy has assigned the Embedding lookup operation to GPU (which is GPU incompatible and I can see why), and I was hoping that tf.config.set_soft_device_placement(True)
would have asked Tensorflow to use CPU instead but it feels like that's ignored.
Has anyone seen this problem before and know of a workaround?
Found a similar issue for TF1.14
:
https://github.com/tensorflow/tensorflow/issues/31318
Looks like MirroredStrategy can't support training embedding layers using momentum-based optimisers.
Cloning the above notebook and using RMSprop (with momentum=0) seemed to work: https://colab.research.google.com/drive/13MXa8Q96M6uzlkK3K_M7vmQfclL59eRj?usp=sharing
I'll use RMSProp with no momentum for now until this issue is fixed. The error message certainly hasn't helped!