When training my transformer model on TPU I am getting the following error:
UnavailableError: 2 root error(s) found.
(0) Unavailable: Socket closed
(1) Invalid argument: Unable to find a context_id matching the specified one (13089686768223941123). Perhaps the worker was restarted, or the context was GC'd?
My data is divided to buckets depending on sequence length to get the best performance:
Length less or equal to 8
from 9 to 16
from 17 to 24
I am loading every batch from random bucket.
When I access every bucket first time - tensorflow kernel creates a new graph and retraces model.
The error happening on the third retracing. So I have no error if I train from any two buckets.
As far as I understand - it is a bug in tf 2.3.
I switched to 2.2.0 and the error disappeared.