I'm using the AleaGPU library to perform matrix multiplications and similar operations, and I can't seem to understand why my code isn't working as expected.
By "not working as expecting" I mean that the resulting matrix has the first row (or the first few rows) with the right values, and the rest of the rows are all filled with 0s, with the same code I've used in the other code samples below.
Function #1 (doesn't work): this one doesn't work for some reason, and it has the behavior described above. It sounds like I've confused an index, but I don't see any difference in the code of the three examples below, and I'm not getting any kind of errors (AleaGPU usually throws an exception when trying to access an invalid array position).
public static double[,] Multiply([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
// Checks
if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");
// Initialize the parameters and the result matrix
int h = m1.GetLength(0);
int w = m2.GetLength(1);
int l = m1.GetLength(1);
// Execute the multiplication in parallel
using (DeviceMemory2D<double> m1_device = Gpu.Default.AllocateDevice(m1))
using (DeviceMemory2D<double> m2_device = Gpu.Default.AllocateDevice(m2))
using (DeviceMemory2D<double> mresult_device = Gpu.Default.AllocateDevice<double>(h, w))
{
// Pointers setup
deviceptr<double>
pm1 = m1_device.Ptr,
pm2 = m2_device.Ptr,
pmresult = mresult_device.Ptr;
// Local wrapper function
void Kernel(int ki)
{
// Calculate the current indexes
int
i = ki / w,
j = ki % w;
// Perform the multiplication
double sum = 0;
int im1 = i * l;
for (int k = 0; k < l; k++)
{
// m1[i, k] * m2[k, j]
sum += pm1[im1 + k] * pm2[k * w + j];
}
pmresult[i * w + j] = sum; // result[i, j]
}
// Get the pointers and iterate fo each row
Gpu.Default.For(0, h * w, Kernel);
// Return the result
return Gpu.Copy2DToHost(mresult_device);
}
}
I looked at this code for hours trying to check every line but I really don't see what's wrong with it.
THIS WORKS FINE, but I don't see the difference with the first one
public static double[,] MultiplyGpuManaged([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
// Checks
if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");
// Initialize the parameters and the result matrix
int h = m1.GetLength(0);
int w = m2.GetLength(1);
int l = m1.GetLength(1);
double[,]
m1_gpu = Gpu.Default.Allocate(m1),
m2_gpu = Gpu.Default.Allocate(m2),
mresult_gpu = Gpu.Default.Allocate<double>(h, w);
// Execute the multiplication in parallel
Gpu.Default.For(0, h * w, index =>
{
// Calculate the current indexes
int
i = index / w,
j = index % w;
// Perform the multiplication
double sum = 0;
for (int k = 0; k < l; k++)
{
sum += m1_gpu[i, k] * m2_gpu[k, j];
}
mresult_gpu[i, j] = sum;
});
// Free memory and copy the result back
Gpu.Free(m1_gpu);
Gpu.Free(m2_gpu);
double[,] result = Gpu.CopyToHost(mresult_gpu);
Gpu.Free(mresult_gpu);
return result;
}
THIS WORKS FINE TOO, I did this additional test to check if I had messed up the indexes in the first function (apparently they are fine)
public static double[,] MultiplyOnCPU([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
// Checks
if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");
// Initialize the parameters and the result matrix
int h = m1.GetLength(0);
int w = m2.GetLength(1);
int l = m1.GetLength(1);
double[,] result = new double[h, w];
Parallel.For(0, h * w, index =>
{
unsafe
{
fixed (double* presult = result, pm1 = m1, pm2 = m2)
{
// Calculate the current indexes
int
i = index / w,
j = index % w;
// Perform the multiplication
double sum = 0;
int im1 = i * l;
for (int k = 0; k < l; k++)
{
sum += pm1[im1 + k] * pm2[k * w + j];
}
presult[i * w + j] = sum;
}
}
});
return result;
}
I really don't understand what I'm missing in the first method and I don't get why it doesn't work.
Thank you in advance for your help!
Turns out the issue was caused by the method used by the gpu to allocate 2D arrays - instead of using a single chunk of contiguous memory like standard .NET arrays, it adds some padding at the end of each row for performance reasons.
The right way to address a 2D gpu array is to use the pitch, which indicates the effective width of each row (columns + padding).
Here's a working code sample that just populates a 2D gpu array and copies it back on the host:
const int size = 10;
double[,] matrix_gpu;
using (DeviceMemory2D<double> m_gpu = Gpu.Default.AllocateDevice<double>(size, size))
{
deviceptr<double> ptr = m_gpu.Ptr;
int pitch = m_gpu.PitchInElements.ToInt32();
Gpu.Default.For(0, size, i =>
{
for (int j = 0; j < size; j++)
{
ptr[i * pitch + j] = i * size + j;
}
});
matrix_gpu = Gpu.Copy2DToHost(m_gpu);
}