Search code examples
tensorflowkerasgpuembedding

How do I train deep learning neural network that contains embedding layer using GPU?


I'm getting a InvalidArgumentError on my embedding layer:

Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
GatherV2: GPU CPU 
Cast: GPU CPU 
Const: GPU CPU 
ResourceSparseApplyAdagradV2: CPU 
_Arg: GPU CPU 
ReadVariableOp: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  model_6_user_embedding_embedding_lookup_readvariableop_resource (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  adagrad_adagrad_update_1_update_0_resourcesparseapplyadagradv2_accum (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  model_6/User-Embedding/embedding_lookup/ReadVariableOp (ReadVariableOp) 
  model_6/User-Embedding/embedding_lookup/axis (Const) 
  model_6/User-Embedding/embedding_lookup (GatherV2) 
  gradient_tape/model_6/User-Embedding/embedding_lookup/Shape (Const) 
  gradient_tape/model_6/User-Embedding/embedding_lookup/Cast (Cast) 
  Adagrad/Adagrad/update_1/update_0/ResourceSparseApplyAdagradV2 (ResourceSparseApplyAdagradV2) /job:localhost/replica:0/task:0/device:GPU:0

     [[{{node model_6/User-Embedding/embedding_lookup/ReadVariableOp}}]] [Op:__inference_train_function_2997]

Link to google colab: https://colab.research.google.com/drive/1ZN1HzSTTfvA_zstuI-EsKjw7Max1f73v?usp=sharing

It's a really simple neural network, and data is available to download from Kaggle - you could just drag and drop into colabs to get it working.

I've also tried to set soft device placement = True tf.config.set_soft_device_placement(True) but that doesn't seem to have worked.

From the error log, it looks like MirroredStrategy has assigned the Embedding lookup operation to GPU (which is GPU incompatible and I can see why), and I was hoping that tf.config.set_soft_device_placement(True) would have asked Tensorflow to use CPU instead but it feels like that's ignored.

Has anyone seen this problem before and know of a workaround?


Solution

  • Found a similar issue for TF1.14: https://github.com/tensorflow/tensorflow/issues/31318

    Looks like MirroredStrategy can't support training embedding layers using momentum-based optimisers.

    Cloning the above notebook and using RMSprop (with momentum=0) seemed to work: https://colab.research.google.com/drive/13MXa8Q96M6uzlkK3K_M7vmQfclL59eRj?usp=sharing

    I'll use RMSProp with no momentum for now until this issue is fixed. The error message certainly hasn't helped!