XLA on CPU -- where do the gains come from?

I understand that XLA performs automatic kernel fusion for a computational graph, which comes handy in reducing memory bandwidth usage on a GPU. What gains can one derive using XLA for a CPU? Is it the same principle, in fusing computations and not writing intermediate results to the L1 cache? I would appreciate a laymen's explanation.

Solution

Yes, basically it's what you said.

In general, the more information (or "context") you, as a compiler, have about a set of computations, the better you can optimize them.

As pointed out in the XLA page, the single most important feature of XLA is fusion.
Instead of computing x + y*z as two separate operations, it can be computed as single fused-multiply-add operation.
This is not only faster (generally) but it also avoids intermediate results which may have smaller precision and need to be stored somewhere.

Probably the TensorFlow model works by taking a set of data from memory and performing one of a defined set of kernels on it, storing each partial result back in memory, so the next kernel can consume it.
With XLA, linear algebra patterns are recognized and further optimized by combining one or more kernels together, avoiding an unnecessary back and forth from memory.

Modern mainstream CPUs have support for "vectors" (in jargon: SIMD) and some do support LA operations as the GPUs do.
So yes, it's the same principle (though GPUs can do a lot more LA operations in parallel, so the gain is bigger there).