CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory -> L2 -> L1 -> register -> shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.

However, the Maxwell GPU (GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into shared memory? Or do they support direct communication between global and shared memory?

the unified cache is enabled with option -dlcm=ca

Solution

This might answer most of your questions about memory types and steps within the Maxwell architecture :

As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.

In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.

Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.

The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.

From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.

The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc. Give it a try !