Search code examples
cudamemcpyinvalid-argument

Is it possible that CUDA's memcpy(dst, src, 0) writes to *dst?


I have some kernel code which calls memcpy(my_dst, my_src, my_num_bytes) - and sometimes I have my_num_bytes equal to 0. Strangely enough, some sporadic experimentation (with Titan X, CUDA 7.5, driver 358.16) suggests data does get written into the destination when I use such a call.

  1. Have you encountered this behavior in CUDA?
  2. Is this specified anywhere? The programming guide entry for memcpy() doesn't say.

Solution

  • This would appear to be a bug in the (current, i.e. CUDA 7.5) implementation of device-side memcpy().

    A kernel like this:

    __global__ void kernel(char* source, char* dst, int len, int sz)
    {
    
        int i = threadIdx.x * len;
    
        memcpy(source+i, dst+i, sz);
    }
    

    leads the toolchain to emit PTX like this:

            // .globl       _Z6kernelPcS_ii
    .visible .entry _Z6kernelPcS_ii(
            .param .u64 _Z6kernelPcS_ii_param_0,
            .param .u64 _Z6kernelPcS_ii_param_1,
            .param .u32 _Z6kernelPcS_ii_param_2,
            .param .u32 _Z6kernelPcS_ii_param_3
    )
    {
            .reg .pred      %p<2>;
            .reg .b16       %rs<2>;
            .reg .b32       %r<4>;
            .reg .b64       %rd<15>;
    
    
            ld.param.u64    %rd7, [_Z6kernelPcS_ii_param_0];
            ld.param.u64    %rd8, [_Z6kernelPcS_ii_param_1];
            ld.param.u32    %r1, [_Z6kernelPcS_ii_param_2];
            cvta.to.global.u64      %rd9, %rd8;
            cvta.to.global.u64      %rd10, %rd7;
            mov.u32         %r2, %tid.x;
            mul.lo.s32      %r3, %r2, %r1;
            cvt.s64.s32     %rd11, %r3;
            add.s64         %rd1, %rd10, %rd11;
            add.s64         %rd2, %rd9, %rd11;
            mov.u64         %rd14, 0;
            ld.param.s32    %rd3, [_Z6kernelPcS_ii_param_3];
    
    BB6_1:
            add.s64         %rd12, %rd2, %rd14;
            ld.global.u8    %rs1, [%rd12];
            add.s64         %rd13, %rd1, %rd14;
            st.global.u8    [%rd13], %rs1;
            add.s64         %rd14, %rd14, 1;
            setp.lt.u64     %p1, %rd14, %rd3;
            @%p1 bra        BB6_1;
    
            ret;
    }
    

    My reading is that this code will always copy at least one byte because the value of the length argument isn't tested until after the byte copy. Something like this:

    BB6_1:
            setp.ge.u64     %p1, %rd14, %rd3;
            @%p1 bra        Done;
            add.s64         %rd12, %rd2, %rd14;
            ld.global.u8    %rs1, [%rd12];
            add.s64         %rd13, %rd1, %rd14;
            st.global.u8    [%rd13], %rs1;
            add.s64         %rd14, %rd14, 1;
            bra             BB6_1;
    Done:
    

    would probably work as expected.