GPGPU - Arithmetic intensity and caches

I am studying theoretical stuff on GPUs used for scientific applications and I found this sentence:

High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches.

What does this exactly mean? Can be interpreted as a suggestion to avoid storing some precomputed results when programming for a GPU, but to compute them every time we run a function on the device?

E.g., suppose we have a code which performs a recursive loop in order to compute a long array, with tons of calculations in it. Besides, suppose we could precompute some partial arrays which would help inside the loop to skip some computations, even some which are not very expensive. According to the quote, should we avoid this but compute these arrays every cycle?

Solution

GPUs have access to different types of memory. The memory type you use to offer the GPU data to work with, and data to retrieve from the GPU when it's done computing is the global memory (for example, a standard GTX480 has 1.5GB of memory).

This memory has high bandwidth, but also high latency (around 400-800 cycles on a GTX480). So instead of precomputing things, storing it in global memory, and then retrieving it (causing high latency), you are better off if you compute it on the GPU. This way, you do not have to wait on memory to retrieve precomputed data.

If all the threads that are active at a given time (= warp), then this causes a high latency since these threads cannot advance because the data has not arrived. GPUs can calculate quite a lot in 400-800 cycles, so it's better to exchange memory fetches for computation.

That being said, you can use other types of memory that is available to you. For example, in CUDA, you have access to on-chip memory (shared memory), which is very fast and has very low latency. You can have one thread in a warp calculate something, store it in shared memory, and have the other threads use that value. So you move the precalculation to the GPU, and use on-chip memory to retrieve the precomputed values.