I am running a memory coalescing experiment on Pascal and getting unexpected nvprof
results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof
reports confusing numbers for gld_transactions_per_request
and gst_transactions_per_request
.
I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results.
#include <stdio.h>
#include <cstdint>
#include <assert.h>
#define N 1ULL*1024*1024*1024
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void copy_kernel(
const float* __restrict__ data, float* __restrict__ data2) {
for (unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < N; tid += blockDim.x * gridDim.x) {
data2[tid] = data[tid];
}
}
int main() {
float* d_data;
gpuErrchk(cudaMalloc(&d_data, sizeof(float) * N));
assert(d_data != nullptr);
uintptr_t d = reinterpret_cast<uintptr_t>(d_data);
assert(d%128 == 0); // check alignment, just to be sure
float* d_data2;
gpuErrchk(cudaMalloc(&d_data2, sizeof(float)*N));
assert(d_data2 != nullptr);
copy_kernel<<<1024,1024>>>(d_data, d_data2);
gpuErrchk(cudaDeviceSynchronize());
}
Compiled with CUDA version 10.1:
nvcc coalescing.cu -std=c++11 -Xptxas -dlcm=ca -gencode arch=compute_61,code=sm_61 -O3
Profiled with:
nvprof -m all ./a.out
There are a few confusing parts in the profiling results:
gld_transactions = 536870914
, which means that every global load transaction should on average be 4GB/536870914 = 8 bytes
. This is consistent with gld_transactions_per_request = 16.000000
: Each warp reads 128 bytes (1 request) and if every transaction is 8 bytes, then we need 128 / 8 = 16
transactions per request. Why is this value so low? I would expect perfect coalescing, so something along the lines of 4 (or even 1) transactions/request.gst_transactions = 134217728
and gst_transactions_per_request = 4.000000
, so storing memory is more efficient?gld_requested_throughput
, gst_requested_throughput
, gld_throughput
, gst_throughput
) is 150.32GB/s
each. I would expect a lower throughput for loads than for stores since we have more transactions per request.gld_transactions = 536870914
but l2_read_transactions = 134218800
. Global memory is always accessed through the L1/L2 caches. Why is the number of L2 read transactions so much lower? It can't all be cached in the L1. (global_hit_rate = 0%
)I think I am reading the nvprof
results wrong. Any suggestions would be appreciated.
Here is the full profiling result:
Device "GeForce GTX 1080 Ti (0)"
Kernel: copy_kernel(float const *, float*)
1 inst_per_warp Instructions per warp 1.4346e+04 1.4346e+04 1.4346e+04
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 99.99% 99.99% 99.99%
1 inst_replay_overhead Instruction Replay Overhead 0.000178 0.000178 0.000178
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 16.000000 16.000000 16.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 536870914 536870914 536870914
1 gst_transactions Global Store Transactions 134217728 134217728 134217728
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 134218800 134218800 134218800
1 l2_write_transactions L2 Write Transactions 134217741 134217741 134217741
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 50.00% 50.00% 50.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
1 tex_cache_throughput Unified Cache Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_read_throughput L2 Throughput (Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_write_throughput L2 Throughput (Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 5.8711KB/s 5.8711KB/s 5.8701KB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified Cache Transactions 134217728 134217728 134217728
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 470089728 470089728 470089728
1 inst_issued Instructions Issued 470173430 470173430 470173430
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.79% 0.79% 0.79%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.46% 1.46% 1.46%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 96.16% 96.16% 96.16%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 1.13% 1.13% 1.13%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.00% 0.00% 0.00%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.07% 0.07% 0.07%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 1.0742e+10 1.0742e+10 1.0742e+10
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 1073741824 1073741824 1073741824
1 inst_compute_ld_st Load/Store Instructions 2147483648 2147483648 2147483648
1 inst_misc Misc Instructions 1077936128 1077936128 1077936128
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 470173430 470173430 470173430
1 cf_issued Issued Control-Flow Instructions 33619968 33619968 33619968
1 cf_executed Executed Control-Flow Instructions 33619968 33619968 33619968
1 ldst_issued Issued Load/Store Instructions 268500992 268500992 268500992
1 ldst_executed Executed Load/Store Instructions 67174400 67174400 67174400
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 134217728 134217728 134217728
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.39% 0.39% 0.39%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 134217728 134217728 134217728
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 pcie_total_data_transmitted PCIe Total Data Transmitted 1024 1024 1024
1 pcie_total_data_received PCIe Total Data Received 0 0 0
1 inst_executed_global_loads Warp level instructions for global loads 33554432 33554432 33554432
1 inst_executed_local_loads Warp level instructions for local loads 0 0 0
1 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0
1 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0
1 inst_executed_global_stores Warp level instructions for global stores 33554432 33554432 33554432
1 inst_executed_local_stores Warp level instructions for local stores 0 0 0
1 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0
1 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0
1 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0
1 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0
1 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0
1 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0
1 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0
1 inst_executed_tex_ops Warp level instructions for texture 0 0 0
1 l2_global_load_bytes Bytes read from L2 for misses in Unified Cache for global loads 4294967296 4294967296 4294967296
1 l2_local_load_bytes Bytes read from L2 for misses in Unified Cache for local loads 0 0 0
1 l2_surface_load_bytes Bytes read from L2 for misses in Unified Cache for surface loads 0 0 0
1 l2_local_global_store_bytes Bytes written to L2 from Unified Cache for local and global stores. 4294967296 4294967296 4294967296
1 l2_global_reduction_bytes Bytes written to L2 from Unified cache for global reductions 0 0 0
1 l2_global_atomic_store_bytes Bytes written to L2 from Unified cache for global atomics 0 0 0
1 l2_surface_store_bytes Bytes written to L2 from Unified Cache for surface stores. 0 0 0
1 l2_surface_reduction_bytes Bytes written to L2 from Unified Cache for surface reductions 0 0 0
1 l2_surface_atomic_store_bytes Bytes transferred between Unified Cache and L2 for surface atomics 0 0 0
1 global_load_requests Total number of global load requests from Multiprocessor 134217728 134217728 134217728
1 local_load_requests Total number of local load requests from Multiprocessor 0 0 0
1 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0
1 global_store_requests Total number of global store requests from Multiprocessor 134217728 134217728 134217728
1 local_store_requests Total number of local store requests from Multiprocessor 0 0 0
1 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0
1 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0
1 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0
1 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0
1 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0
1 sysmem_read_bytes System Memory Read Bytes 0 0 0
1 sysmem_write_bytes System Memory Write Bytes 160 160 160
1 l2_tex_hit_rate L2 Cache Hit Rate 0.00% 0.00% 0.00%
1 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0
1 unique_warps_launched Number of warps launched 32768 32768 32768
1 sm_efficiency Multiprocessor Activity 99.63% 99.63% 99.63%
1 achieved_occupancy Achieved Occupancy 0.986477 0.986477 0.986477
1 ipc Executed IPC 0.344513 0.344513 0.344513
1 issued_ipc Issued IPC 0.344574 0.344574 0.344574
1 issue_slot_utilization Issue Slot Utilization 8.61% 8.61% 8.61%
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.592326 0.592326 0.592326
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 tex_fu_utilization Texture Function Unit Utilization Low (1) Low (1) Low (1)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 dram_read_transactions Device Memory Read Transactions 134218560 134218560 134218560
1 dram_write_transactions Device Memory Write Transactions 134176900 134176900 134176900
1 dram_read_throughput Device Memory Read Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 dram_write_throughput Device Memory Write Throughput 150.27GB/s 150.27GB/s 150.27GB/s
1 dram_utilization Device Memory Utilization High (7) High (7) High (7)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 ecc_transactions ECC Transactions 0 0 0
1 ecc_throughput ECC Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 dram_read_bytes Total bytes read from DRAM to L2 cache 4294993920 4294993920 4294993920
1 dram_write_bytes Total bytes written from L2 cache to DRAM 4293660800 4293660800 4293660800
With Fermi and Kepler GPUs, when a global transaction was issued, it was always for 128 bytes, and the L1 cacheline size (if enabled) was 128 bytes. With Maxwell and Pascal, these characteristics changed. In particular, a read of a portion of an L1 cacheline does not necessarily trigger a full 128-byte width transaction. This is fairly easily discoverable/provable with microbenchmarking.
Effectively, the size of a global load transaction changed, subject to a certain quantum of granularity. Based on this change of transaction size, it's possible that multiple transactions could be required, where previously only 1 was required. As far as I know, none of this is clearly published or detailed, and I won't be able to do that here. However I think we can address a number of your questions without giving a precise description of how global load transactions are calculated.
gld_transactions
= 536870914, which means that every global load transaction should on average be 4GB/536870914 = 8 bytes. This is consistent withgld_transactions_per_request
= 16.000000: Each warp reads 128 bytes (1 request) and if every transaction is 8 bytes, then we need 128 / 8 = 16 transactions per request. Why is this value so low? I would expect perfect coalescing, so something along the lines of 4 (or even 1) transactions/request.
This mindset (1 transaction per request for fully coalesced loads of a 32-bit quantity per thread) would have been correct in the Fermi/Kepler timeframe. It is no longer correct for Maxwell and Pascal GPUs. As you've already calculated, the transaction size appears to be smaller than 128 bytes, and therefore the number of transactions per request is higher than 1. But this doesn't indicate an efficiency problem per se (as it would have in Fermi/Kepler timeframe). So let's just acknowledge that the transaction size can be smaller and therefore transactions per request can be higher, even though the underlying traffic is essentially 100% efficient.
gst_transactions = 134217728 and gst_transactions_per_request = 4.000000, so storing memory is more efficient?
No, that's not what this means. It simply means that the subdivision quanta can be different for loads (load transactions) and stores (store transactions). These happen to be 32-byte transactions. In either case, loads or stores, the transactions are and should be fully efficient in this case. The requested traffic is consistent with the actual traffic, and other profiler metrics confirm this. If the actual traffic were much higher than the requested traffic, that would be a good indication of inefficient loads or stores:
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
Requested and achieved global load/store throughput (gld_requested_throughput, gst_requested_throughput, gld_throughput, gst_throughput) is 150.32GB/s each. I would expect a lower throughput for loads than for stores since we have more transactions per request.
Again, you'll have to adjust your way of thinking to account for variable transaction sizes. Throughput is driven by the needs and efficiency associated with fulfilling those needs. Both loads and stores are fully efficient for your code design, so there is no reason to think there is or should be an imbalance in efficiency.
gld_transactions = 536870914 but l2_read_transactions = 134218800. Global memory is always accessed through the L1/L2 caches. Why is the number of L2 read transactions so much lower? It can't all be cached in the L1. (global_hit_rate = 0%)
This is simply due to the different size of the transactions. You've already calculated that the apparent global load transaction size is 8 bytes, and I've already indicated that the L2 transaction size is 32 bytes, so it makes sense that there would be a 4:1 ratio between the total number of transactions, since they reflect the same movement of the same data, viewed through 2 different lenses. Note that there has always been a disparity in the size of global transactions vs. the size of L2 transactions, or transactions to DRAM. Its simply that the ratios of these may vary by GPU architecture, and possibly other factors, such as load patterns.
Some notes:
I won't be able to answer questions such as "why is it this way?", or "why did Pascal change from Fermi/Kepler?" or "given this particular code, what would you predict as the needed global load transactions on this particular GPU?", or "generally, for this particular GPU, how would I calculate or predict transaction size?"
As an aside, there are new profiling tools (Nsight Compute and Nsight Systems) being advanced by NVIDIA for GPU work. Many of the efficiency and transactions per request metrics which are available in nvprof
are gone under the new toolchain. So these mindsets will have to be broken anyway, because these methods of ascertaining efficiency won't be available moving forward, based on the current metric set.
Note that the use of compile switches such as -Xptxas -dlcm=ca
may affect (L1) caching behavior. I don't expect caches to have much performance or efficiency impact on this particular copy code, however.
This possible reduction in transaction size is generally a good thing. It results in no loss of efficiency for traffic patterns such as presented in this code, and for certain other codes it allows (less-than-128byte) requests to be satisfied with less wasted bandwidth.
Although not specifically Pascal, here is a better defined example of the possible variability in these measurements for Maxwell. Pascal will have similar variability. Also, some small hint of this change (especially for Pascal) was given in the Pascal Tuning Guide. It by no means offers a complete description or explains all of your observations, but it does hint at the general idea that the global transactions are no longer fixed to a 128-byte size.