Since its possible to use Clang for CUDA compilation, I am interested in studying about the cuda code (.cu file) conversion by the clang to intermediate representation (IR).
The CUDA compilation by Clang require certain CUDA libraries. So is the parsing of the keyword __shared__
in CUDA program is done by Clang or by the CUDA compiler? From my initial searches, I believe the conversion is done by CUDA and not Clang. Is this understanding correct?
When clang compiles CUDA code the Nvidia NVCC compiler is not involved.
__shared__
or more accurately __attribute__((shared))
is an attribute clang knows. If clang encounters a variable marked with the shared attribute it will do two things:
Compiling this little program with clang:
__global__ void foo(int* tmp)
{
__shared__ int vec[32];
vec[threadIdx.x] = tmp[threadIdx.x];
tmp[threadIdx.y] = vec[threadIdx.y];
}
int main()
{
int* tmp;
foo<<<1, 1>>>(tmp);
return tmp[0];
}
results in the following IR:
; ModuleID = 'sm.cu'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-unknown-unknown"
@vec= internal unnamed_addr addrspace(3) global [32 x i32] zeroinitializer, align 4
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
; Function Attrs: nounwind readnone
declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #0
define ptx_kernel void @__pacxx_kernel0(i32 addrspace(1)* %tmp) {
%1 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1
%2 = zext i32 %1 to i64
%3 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %2
%4 = load i32, i32 addrspace(1)* %3, align 4
%5 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %2
store i32 %4, i32 addrspace(3)* %5, align 4
%6 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.y() #1
%7 = zext i32 %6 to i64
%8 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %7
%9 = load i32, i32 addrspace(3)* %8, align 4
%10 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %7
store i32 %9, i32 addrspace(1)* %10, align 4
ret void
}
You can see the variable vec
has static (but internal) linkages inside of the module and resides in address space 3.
Clang follows the NVVM IR specifications which can be found here. However, NVVM IR is specified for LLVM 3.4 and you may encounter problems if you use IR generated by newer LLVM/Clang versions. The NVPTX backend from LLVM however, does not have this restrictions and can generate PTX code without problems. Clang (in newer versions) will build a fat bin just like NVCC does it. In older versions of Clang you have to build your executable on your own and compile the device part of the program with the CUDAIsDevice
command line flag.
The PTX code can than be used to program a GPU by linking it with the CUDA API.
EDIT:
Since the question comes where the __shared__
attribute is defined here is where:
in the clang headers host_defines.h
is included from the CUDA Toolkit. In host_defines.h
(from CUDA 7.5) you can see:
192 #define __shared__ \
193 __location__(shared)
and __location__
(which is another macro definition) expands to __annotate__
85 #define __annotate__(a) \
86 __attribute__((a))
87 #define __location__(a) \
88 __annotate__(a)
which is expanded to __attribute__
as I wrote in the first part of the answer. So __shared__
is expanded to __attribute__((shared))
.