I have a cuda program that I am running on a 680gtx, while testing different compiler options I noticed that:
compiling my code for compute capability 1.0 and sm 1.0 gives a runtime of 47ms
compiling my code for compute capability 3.5 ( also 2.0 ) and sm 3.0 gives a runtime of 60ms
what might be the reasons for such results?
I am compiling on nsight compiler on linux and CUDA 5.0 and my kernel is mostly memory bound.
thanks.
the commands:
cc 1.0
nvcc --compile -O0 -Xptxas -v -gencode arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -keep -keep-dir /tmp/debug -lineinfo -pg -v -x cu -o "BenOlaCuda/src/main.o" "../BenOlaCuda/src/main.cu"
cc 3.0
nvcc -lineinfo -pg -O0 -v -keep -keep-dir /tmp/debug -Xptxas -v -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -odir "BenOlaCuda/src" -M -o "BenOlaCuda/src/main.d" "../BenOlaCuda/src/main.cu"
some more info on compiling my kernel:
cc 1.0
ptxas info : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_10'
ptxas info : Used 40 registers, 68 bytes smem, 64 bytes cmem[1], 68 bytes lmem
cc 3.0
ptxas info : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_30'
ptxas info : Function properties for _Z15optimizePixelZ3tfPfS_S_S_tttttt
128 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info : Used 63 registers, 380 bytes cmem[0], 20 bytes cmem[2]
About two years ago i switched my simulation from CUDA3.2 to CUDA4.0 and experienced a performance hit of about 10%. With Compute Capability 2.0 nVidia introduced IEEE754-2008 conform floating point calculation (CC 1.0 used IEEE754-1985). This, and the removal of "flush to zero" was the reason for the performance hit. Try compile your CC 3.0 executable with compiler flag --use_fast_math. This enables the old preciseness of CC 1.0.