I need to parallelize this code. As you can see, the q
array contains values that are summed at every loop.
/// <summary>
/// q = A * d, with A stored in CSR format.
/// </summary>
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] q = new double[order];
for (int i = 0; i < order; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
}
return q;
}
I was able to parallelize the code in this way (splitting the problem in a CPU count number of sections and summing up resulting q
arrays in the end) but it seems overcomplicated. Any better idea?
Parallel.For(0, cpuCount, delegate(int i)
{
MultiplySection(startIndices[i], endIndices[i], d, A, J, I, qSection[i]);
});
double[] q = new double[order];
for (int j = 0; j < cpuCount; j++)
{
for (int i = startIndices[j]; i < order; i++)
{
q[i] += qSection[j][i];
}
}
private static void MultiplySection(int startIndex, int endIndex, double[] d, List<double> A, List<int> J, int[] I, double[] q)
{
for (int i = startIndex; i < endIndex; i++)
{
int indexFirst = I[i];
q[i] += A[indexFirst]*d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a*d[col];
q[col] += a*d[i];
}
}
}
Your current implementation is coarse grained, so if a CPU core is busy with something else you risk having poor utilization at the end of your process.
I would at least try using the built in partitioning by Parallel.For. This should give a much fine grained concurrency that might help with efficiency. You can also use localInit/localFinally methods to create and merge your arrays:
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
double[] qTotal = new double[order];
Parallel.For(0,
order,
() => new double[order], // LocalInit
(i, loopState, q) => // Main Body
{
int indexFirst = I[i];
q[i] += A[indexFirst] * d[J[indexFirst]];
for (int j = indexFirst + 1; j < I[i + 1]; j++)
{
int col = J[j];
double a = A[j];
q[i] += a * d[col];
q[col] += a * d[i];
}
return q;
},
q => // Local Finally
{
lock (qTotal)
{
for (int i = 0; i < q.Length; i++)
{
qTotal[i] += q[i];
}
}
});
return qTotal;
}
If the amount of work inside the loop body is small you might still benefit from batching, but you might want to use a fixed batch size instead of basing it of the number of processors. This should help a bit with various types of overheads. You will likely need to experiment a bit to find your optimal batch size.