tensorflow tensorflow2.0 ctc beam-search

Does NOT `tf.nn.ctc_beam_search_decoder()` support GPU in TensorFlow2?

Now, I try to use tf.nn.ctc_beam_search_decoder() on GPU.
But I have a problem that it does not use GPU.

I was able to check that other tensorflow functions(e.g. Reshape and SigmoidGrad etc.) run on GPU.
But some ones including ctc_beam_search_decoder() only run on CPU, and ctc_beam_search_decoder() is slow.

Then, I have two questions.
First, does not ctc_beam_search_decoder() support GPU in TensorFlow2 ?
Second, if it's supported, could you give me how to implement or the function (or method) ?

I show simple example below.

program code.

import tensorflow as tf
from tensorflow.python.client import device_lib

tf.debugging.set_log_device_placement(True)
print(device_lib.list_local_devices())

inputs = tf.convert_to_tensor([
    [0.1, 0.2, 0.3, 0.4, 0.5],
    [0.2, 0.0, 0.3, 0.1, 0.1],
    [0.2, 0.21, 0.3, 0.4, 0.1],
    [0.2, 0.0, 0.6, 0.1, 0.5],
    [0.2, 1.2, 0.3, 2.1, 0.1]])

inputs = tf.expand_dims(inputs, axis=1)
inputs_len = tf.convert_to_tensor([5])

decoded, _ = tf.nn.ctc_beam_search_decoder(inputs, inputs_len)

result(std output).

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 714951449022474384
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11733532016050292601
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 394441871956590417
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 11150726272
locality {
  bus_id: 1
  links {
  }
}
incarnation: 5917663253173554940
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]
Executing op ExpandDims in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op CTCBeamSearchDecoder in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:0

Ignore the inputs and outputs data and focus on the device being used.
In this case, ExpandDims and StridedSlice were executed on GPU. But CTCBeamSearchDecoder was not executed on GPU.

Solution

The beam search decoder is implemented in plain C++, so it runs on the CPU and not on the GPU (code see here [1], which is basically the same as in TF1).

Beam search is an iterative algorithm (goes from one time-step to the next), so I don't think running it on the GPU would give much of a performance improvement. The simplest way to improve runtime is to tune the beam width (the smaller the faster, the larger the more accurate).

[1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/ctc/ctc_beam_search.h#L159