In my CUDA application, I am copying data from device memory to shared memory. Is that data cached in L1 as well?
By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.