Passing complex number arrays into PyCUDA Kernel

I'm trying to pass a 2-D array of complex numbers into a PyCUDA kernel, and am getting unexpected results.

Here's my test code:

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.compiler import SourceModule

mod = SourceModule("""
  #include <pycuda-complex.hpp>
  #include <stdio.h>

  typedef pycuda::complex<float> cmplx;
  __global__ void myFunc(cmplx *A)
  {
    // A      : input, array shape (), of type complex64

    int ROWS = 3;
    int COLS = 2;

    printf("\\nKernel >>");
    for(int row = 0; row < ROWS; row++)
    {
        printf("\\n");
        for(int col = 0; col < COLS; col++)
        {
            printf("[row %d, col %d]: %f + %fi",row, col, A[row,col].real(), A[row,col].imag());
            printf("\\t");
        }
    }

    printf("\\n\\n");  
  }
  """)


A = np.zeros((3,2),dtype=complex)
A[0,0] = 1.23 + 3.5j
A[1,0] = 3.4 + 1.0j

A_gpu = gpuarray.to_gpu(A.astype(np.complex64))

print("Host >>")
print(A_gpu)

func = mod.get_function("myFunc")
func(A_gpu,
     block=(1,1,1), grid=(1, 1, 1)
    )

The results are as follows:

Host >>
[[1.23+3.5j 0.  +0.j ]
 [3.4 +1.j  0.  +0.j ]
 [0.  +0.j  0.  +0.j ]]

Kernel >>
[row 0, col 0]: 1.230000 + 3.500000i    [row 0, col 1]: 0.000000 + 0.000000i    
[row 1, col 0]: 1.230000 + 3.500000i    [row 1, col 1]: 0.000000 + 0.000000i    
[row 2, col 0]: 1.230000 + 3.500000i    [row 2, col 1]: 0.000000 + 0.000000i

Could anybody explain why the array in the kernel doesn't look like the one I pass to it?

Solution

The indexing in your kernel code is broken (see here as to why).

While A[row,col] is technically valid syntax in C++, it doesn't imply multidimensional array slicing as it does in Python. In fact, A[row,col] evaluates to A[row], so it should be obvious why the output of print statement doesn't match your expectations.

Numpy arrays are storage contiguously in memory, and you must use your own indexing scheme to access the array. By default, numpy uses row major ordering for multidimensional arrays. This:

mod = SourceModule("""
  #include <pycuda-complex.hpp>
  #include <stdio.h>

  typedef pycuda::complex<float> cmplx;
  __global__ void myFunc(cmplx *A)
  {
    // A      : input, array shape (), of type complex64

    int ROWS = 3;
    int COLS = 2;

    printf("\\nKernel >>");
    for(int row = 0; row < ROWS; row++)
    {
        printf("\\n");
        for(int col = 0; col < COLS; col++)
        {
            printf("[row %d, col %d]: %f + %fi",row, col, A[row*COLS+col].real(), A[row*COLS+col].imag());
            printf("\\t");
        }
    }

    printf("\\n\\n");  
  }
""")

will work as expected:

%run complex_print.py
Host >>
[[ 1.23000002+3.5j  0.00000000+0.j ]
 [ 3.40000010+1.j   0.00000000+0.j ]
 [ 0.00000000+0.j   0.00000000+0.j ]]

Kernel >>
[row 0, col 0]: 1.230000 + 3.500000i    [row 0, col 1]: 0.000000 + 0.000000i    
[row 1, col 0]: 3.400000 + 1.000000i    [row 1, col 1]: 0.000000 + 0.000000i    
[row 2, col 0]: 0.000000 + 0.000000i    [row 2, col 1]: 0.000000 + 0.000000i