Search code examples
cudaclangllvmllvm-irgpu-shared-memory

Conversion from ___attribute___((shared)) to addrspace(3) in Clang compiler when compiling CUDA files


The clang compiler includes CUDA header file host_defines.h in which the __shared__ is defined as __attribute__((shared)). When CUDA source files are compiled to internal representation (IR) using clang, the __shared__ gets converted to addrspace(3). These address spaces can be observed in the clang file llvm/tools/clang/lib/Basic/Targets.cpp line number 1601 as an array

static const unsigned NVPTXAddrSpaceMap[] = {
    1, // opencl_global
    3, // opencl_local
    4, // opencl_constant
    // FIXME: generic has to be added to the target
    0, // opencl_generic
    1, // cuda_device
    4, // cuda_constant
    3, // cuda_shared
};

So the specific question is at which stage of conversion, the __attribute__((shared)) gets converted as addrspace(3). Looking at the parsing and lexing parts of clang gave no intimation about this. Can somebody please help ?


Solution

  • The shared attribute is defined in clang's Attr.td file and is called CUDAShared and represented as CUDASharedAttr internally. Lexing and parsing is done for all Attributes defined in Attr.td during the lexing and parsing stage for any Attrbiute. At this stage you won't find any necessary insights.

    The first point where you will see valuable code for CUDASharedAttr is located in clang/lib/Sema/SemaDeclAttr.cpp. The Sema class builds the AST and in SemaDeclAttr.cpp the handling for each Attribute is done. For the particular CUDASharedAttr handleSimpleAttribute<CUDASharedAttr>(S, D, Attr); is called. This function just inserts the Attribute to a given declaration (Decl& D).

    Now that the Attribute is attached to a Decl you can query if the declaration has the attribute using: D.hasAttr<CUDASharedAttr>(). This for example in SemaDecl.cpp where the limitations on CUDA shared memory declarations are enforced and the storage class of the shared memory variable is set to static.

    You will again find CUDASharedAttr bin clang/lib/CodeGen/CodeGenModule.cpp where the actual LLVM IR is emitted. In CodeGenModule.cpp you have the following function:

    unsigned CodeGenModule::GetGlobalVarAddressSpace(const VarDecl *D,
                                                     unsigned AddrSpace) {
      if (LangOpts.CUDA && LangOpts.CUDAIsDevice) {
        if (D->hasAttr<CUDAConstantAttr>())
          AddrSpace = getContext().getTargetAddressSpace(LangAS::cuda_constant);
        else if (D->hasAttr<CUDASharedAttr>())
          AddrSpace = getContext().getTargetAddressSpace(LangAS::cuda_shared);
        else
          AddrSpace = getContext().getTargetAddressSpace(LangAS::cuda_device);
      }
    
      return AddrSpace;
    }
    

    The function querys the address space for a shared function from the actual target, i.e. for the nvptx targets the address space map you posted is used:

    static const unsigned NVPTXAddrSpaceMap[] = {
        1, // opencl_global
        3, // opencl_local
        4, // opencl_constant
        // FIXME: generic has to be added to the target
        0, // opencl_generic
        1, // cuda_device
        4, // cuda_constant
        3, // cuda_shared
    };
    

    LangAS::cuda_shared corresponds to address space 3.

    After all these steps you will get a global variable in with address space 3 in your final IR module like this:

     ; ModuleID = 'sm.cu'
      target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
      target triple = "nvptx64-unknown-unknown"
    
      @vec= internal unnamed_addr addrspace(3) global [32 x i32] zeroinitializer, align 4
    
      ; Function Attrs: nounwind readnone
      declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0
    
      ; Function Attrs: nounwind readnone
      declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #0
    
      define ptx_kernel void @__pacxx_kernel0(i32 addrspace(1)* %tmp) {
        %1 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1
        %2 = zext i32 %1 to i64
        %3 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %2
        %4 = load i32, i32 addrspace(1)* %3, align 4
        %5 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %2
        store i32 %4, i32 addrspace(3)* %5, align 4
        %6 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.y() #1
        %7 = zext i32 %6 to i64
        %8 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %7
        %9 = load i32, i32 addrspace(3)* %8, align 4
        %10 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %7
        store i32 %9, i32 addrspace(1)* %10, align 4
        ret void
      }