cuda gpgpu nvidia compiler-optimization nsight

CUDA compute capability 1.0 faster than 3.5

I have a cuda program that I am running on a 680gtx, while testing different compiler options I noticed that:

compiling my code for compute capability 1.0 and sm 1.0 gives a runtime of 47ms
compiling my code for compute capability 3.5 ( also 2.0 ) and sm 3.0 gives a runtime of 60ms

what might be the reasons for such results?

I am compiling on nsight compiler on linux and CUDA 5.0 and my kernel is mostly memory bound.

thanks.

the commands:

cc 1.0

nvcc --compile -O0 -Xptxas -v -gencode arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -keep -keep-dir /tmp/debug -lineinfo -pg -v  -x cu -o  "BenOlaCuda/src/main.o" "../BenOlaCuda/src/main.cu"

cc 3.0

nvcc -lineinfo -pg -O0 -v -keep -keep-dir /tmp/debug -Xptxas -v -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -odir "BenOlaCuda/src" -M -o "BenOlaCuda/src/main.d" "../BenOlaCuda/src/main.cu"

some more info on compiling my kernel:

cc 1.0

ptxas info    : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_10'
ptxas info    : Used 40 registers, 68 bytes smem, 64 bytes cmem[1], 68 bytes lmem

cc 3.0

ptxas info    : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_30'
ptxas info    : Function properties for _Z15optimizePixelZ3tfPfS_S_S_tttttt
128 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info    : Used 63 registers, 380 bytes cmem[0], 20 bytes cmem[2]

Solution

About two years ago i switched my simulation from CUDA3.2 to CUDA4.0 and experienced a performance hit of about 10%. With Compute Capability 2.0 nVidia introduced IEEE754-2008 conform floating point calculation (CC 1.0 used IEEE754-1985). This, and the removal of "flush to zero" was the reason for the performance hit. Try compile your CC 3.0 executable with compiler flag --use_fast_math. This enables the old preciseness of CC 1.0.