Same code, different behavior with C#, AleaGPU and device memory

I'm using the AleaGPU library to perform matrix multiplications and similar operations, and I can't seem to understand why my code isn't working as expected.

By "not working as expecting" I mean that the resulting matrix has the first row (or the first few rows) with the right values, and the rest of the rows are all filled with 0s, with the same code I've used in the other code samples below.

Function #1 (doesn't work): this one doesn't work for some reason, and it has the behavior described above. It sounds like I've confused an index, but I don't see any difference in the code of the three examples below, and I'm not getting any kind of errors (AleaGPU usually throws an exception when trying to access an invalid array position).

public static double[,] Multiply([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);

    // Execute the multiplication in parallel
    using (DeviceMemory2D<double> m1_device = Gpu.Default.AllocateDevice(m1))
    using (DeviceMemory2D<double> m2_device = Gpu.Default.AllocateDevice(m2))
    using (DeviceMemory2D<double> mresult_device = Gpu.Default.AllocateDevice<double>(h, w))
    {
        // Pointers setup
        deviceptr<double>
            pm1 = m1_device.Ptr,
            pm2 = m2_device.Ptr,
            pmresult = mresult_device.Ptr;

        // Local wrapper function
        void Kernel(int ki)
        {
            // Calculate the current indexes
            int
                i = ki / w,
                j = ki % w;

            // Perform the multiplication
            double sum = 0;
            int im1 = i * l;
            for (int k = 0; k < l; k++)
            {
                // m1[i, k] * m2[k, j]
                sum += pm1[im1 + k] * pm2[k * w + j];
            }
            pmresult[i * w + j] = sum; // result[i, j]
        }

        // Get the pointers and iterate fo each row
        Gpu.Default.For(0, h * w, Kernel);

        // Return the result
        return Gpu.Copy2DToHost(mresult_device);
    }
}

I looked at this code for hours trying to check every line but I really don't see what's wrong with it.

THIS WORKS FINE, but I don't see the difference with the first one

public static double[,] MultiplyGpuManaged([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);
    double[,]
        m1_gpu = Gpu.Default.Allocate(m1),
        m2_gpu = Gpu.Default.Allocate(m2),
        mresult_gpu = Gpu.Default.Allocate<double>(h, w);

    // Execute the multiplication in parallel
    Gpu.Default.For(0, h * w, index =>
    {
        // Calculate the current indexes
        int
            i = index / w,
            j = index % w;

        // Perform the multiplication
        double sum = 0;
        for (int k = 0; k < l; k++)
        {
            sum += m1_gpu[i, k] * m2_gpu[k, j];
        }
        mresult_gpu[i, j] = sum;
    });

    // Free memory and copy the result back
    Gpu.Free(m1_gpu);
    Gpu.Free(m2_gpu);
    double[,] result = Gpu.CopyToHost(mresult_gpu);
    Gpu.Free(mresult_gpu);
    return result;
}

THIS WORKS FINE TOO, I did this additional test to check if I had messed up the indexes in the first function (apparently they are fine)

public static double[,] MultiplyOnCPU([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);
    double[,] result = new double[h, w];
    Parallel.For(0, h * w, index =>
    {
        unsafe
        {
            fixed (double* presult = result, pm1 = m1, pm2 = m2)
            {
                // Calculate the current indexes
                int
                    i = index / w,
                    j = index % w;

                // Perform the multiplication
                double sum = 0;
                int im1 = i * l;
                for (int k = 0; k < l; k++)
                {
                    sum += pm1[im1 + k] * pm2[k * w + j];
                }
                presult[i * w + j] = sum;
            }
        }
    });
    return result;
}

I really don't understand what I'm missing in the first method and I don't get why it doesn't work.

Thank you in advance for your help!

Solution

Turns out the issue was caused by the method used by the gpu to allocate 2D arrays - instead of using a single chunk of contiguous memory like standard .NET arrays, it adds some padding at the end of each row for performance reasons.

The right way to address a 2D gpu array is to use the pitch, which indicates the effective width of each row (columns + padding).

Here's a working code sample that just populates a 2D gpu array and copies it back on the host:

const int size = 10;
double[,] matrix_gpu;
using (DeviceMemory2D<double> m_gpu = Gpu.Default.AllocateDevice<double>(size, size))
{
    deviceptr<double> ptr = m_gpu.Ptr;
    int pitch = m_gpu.PitchInElements.ToInt32();
    Gpu.Default.For(0, size, i =>
    {
        for (int j = 0; j < size; j++)
        {
            ptr[i * pitch + j] = i * size + j;
        }
    });
    matrix_gpu = Gpu.Copy2DToHost(m_gpu);
}