How can I compile a CUDA application that targets both Kepler and Maxwell Architectures?

I do development on desktops, which have a Titan X card (Maxwell architecture). However, the production code runs on servers which have K40 cards (Kepler architecture).

How can I build my code so that it runs optimally on both systems?

So far, I have used compute_20,sm_20 but I think that this setting is not optimal.

Solution

The first thing you would want to do is build a fat binary that contains machine code (SASS) for sm_35 (the architecture of the K40) and sm_52 (the architecture of the Titan X), plus intermediate code (PTX) for compute_52, for JIT compilation on future GPUs. You do so via the -gencode switch of nvcc:

nvcc -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52

This ensures that the executable code generated is best suited to, and makes full use of, each of the specified architectures. When the CUDA driver or runtime loads a kernel when running with a specific GPU, it will automatically select the version with the matching machine code.

What building a fat binary does not do is adjust various parameters of your code, such as the launch configurations of kernels, to be optimal for the different architectures. So if you need to achieve the best possible performance on either platform you would want to profile the application and consider machine-specific source code adjustments based on the result of the profiling experiments.