Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?
Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around). It is also relatively common practice to batch up states from multiple different episodes that are running in parallel (either truly in paralell with multiple CPU threads, or even just sequentially taking one time step in each episode), and have a GPU (or TPU) process them all at once. This is because, usually, we have enough VRAM to fit multiple forwards and backwards passes for multiple different states at once on the GPU.
More recently* there is also a trend where people try to actually have the logic of the environment itself also running as much as possible in parallel, on devices such as GPUs or TPUs. Examples include Google's brax for physics simulation, and JAX-LOB for RL-based trading on financial exchanges.
*wrote this as an edit in 2023, original answer was from 2018.