c#.net parallel-processing task-parallel-library parallel.foreach

How do I parallelize this matrix by vector multiplication?

I need to parallelize this code. As you can see, the q array contains values that are summed at every loop.

/// <summary>
/// q = A * d, with A stored in CSR format.
/// </summary>
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{ 
    double[] q = new double[order];

    for (int i = 0; i < order; i++)
    {                    
       int indexFirst = I[i];

       q[i] += A[indexFirst] * d[J[indexFirst]];

       for (int j = indexFirst + 1; j < I[i + 1]; j++)
       {
          int col = J[j];
          double a = A[j];

          q[i] += a * d[col];

          q[col] += a * d[i];
       }
   }

   return q;
}

I was able to parallelize the code in this way (splitting the problem in a CPU count number of sections and summing up resulting q arrays in the end) but it seems overcomplicated. Any better idea?

Parallel.For(0, cpuCount, delegate(int i)
{
    MultiplySection(startIndices[i], endIndices[i], d, A, J, I, qSection[i]);
});

double[] q = new double[order];

for (int j = 0; j < cpuCount; j++)
{
    for (int i = startIndices[j]; i < order; i++)
    {
        q[i] += qSection[j][i];
    }
}

private static void MultiplySection(int startIndex, int endIndex, double[] d, List<double> A, List<int> J, int[] I, double[] q)
{
    for (int i = startIndex; i < endIndex; i++)
    {
        int indexFirst = I[i];

        q[i] += A[indexFirst]*d[J[indexFirst]];

        for (int j = indexFirst + 1; j < I[i + 1]; j++)
        {
            int col = J[j];
            double a = A[j];

            q[i] += a*d[col];

            q[col] += a*d[i];
        }
    }
}

Solution

Your current implementation is coarse grained, so if a CPU core is busy with something else you risk having poor utilization at the end of your process.

I would at least try using the built in partitioning by Parallel.For. This should give a much fine grained concurrency that might help with efficiency. You can also use localInit/localFinally methods to create and merge your arrays:

public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{
    double[] qTotal = new double[order];

    Parallel.For(0,
        order,
        () => new double[order], // LocalInit
        (i, loopState, q) => // Main Body
        {
            int indexFirst = I[i];

            q[i] += A[indexFirst] * d[J[indexFirst]];

            for (int j = indexFirst + 1; j < I[i + 1]; j++)
            {
                int col = J[j];
                double a = A[j];

                q[i] += a * d[col];

                q[col] += a * d[i];
            }
            return q;
        },
        q => // Local Finally 
        {
            lock (qTotal)
            {
                for (int i = 0; i < q.Length; i++)
                {
                    qTotal[i] += q[i];
                }
            }
        });
    return qTotal;
}

If the amount of work inside the loop body is small you might still benefit from batching, but you might want to use a fixed batch size instead of basing it of the number of processors. This should help a bit with various types of overheads. You will likely need to experiment a bit to find your optimal batch size.