Search code examples
c#.netparallel-processingtask-parallel-libraryparallel.foreach

How do I parallelize this matrix by vector multiplication?


I need to parallelize this code. As you can see, the q array contains values that are summed at every loop.

/// <summary>
/// q = A * d, with A stored in CSR format.
/// </summary>
public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
{ 
    double[] q = new double[order];

    for (int i = 0; i < order; i++)
    {                    
       int indexFirst = I[i];

       q[i] += A[indexFirst] * d[J[indexFirst]];

       for (int j = indexFirst + 1; j < I[i + 1]; j++)
       {
          int col = J[j];
          double a = A[j];

          q[i] += a * d[col];

          q[col] += a * d[i];
       }
   }

   return q;
}

I was able to parallelize the code in this way (splitting the problem in a CPU count number of sections and summing up resulting q arrays in the end) but it seems overcomplicated. Any better idea?

Parallel.For(0, cpuCount, delegate(int i)
{
    MultiplySection(startIndices[i], endIndices[i], d, A, J, I, qSection[i]);
});

double[] q = new double[order];

for (int j = 0; j < cpuCount; j++)
{
    for (int i = startIndices[j]; i < order; i++)
    {
        q[i] += qSection[j][i];
    }
}

private static void MultiplySection(int startIndex, int endIndex, double[] d, List<double> A, List<int> J, int[] I, double[] q)
{
    for (int i = startIndex; i < endIndex; i++)
    {
        int indexFirst = I[i];

        q[i] += A[indexFirst]*d[J[indexFirst]];

        for (int j = indexFirst + 1; j < I[i + 1]; j++)
        {
            int col = J[j];
            double a = A[j];

            q[i] += a*d[col];

            q[col] += a*d[i];
        }
    }
}

Solution

  • Your current implementation is coarse grained, so if a CPU core is busy with something else you risk having poor utilization at the end of your process.

    I would at least try using the built in partitioning by Parallel.For. This should give a much fine grained concurrency that might help with efficiency. You can also use localInit/localFinally methods to create and merge your arrays:

    public static double[] Multiplication(int order, double[] d, List<double> A, List<int> J, int[] I)
    {
        double[] qTotal = new double[order];
    
        Parallel.For(0,
            order,
            () => new double[order], // LocalInit
            (i, loopState, q) => // Main Body
            {
                int indexFirst = I[i];
    
                q[i] += A[indexFirst] * d[J[indexFirst]];
    
                for (int j = indexFirst + 1; j < I[i + 1]; j++)
                {
                    int col = J[j];
                    double a = A[j];
    
                    q[i] += a * d[col];
    
                    q[col] += a * d[i];
                }
                return q;
            },
            q => // Local Finally 
            {
                lock (qTotal)
                {
                    for (int i = 0; i < q.Length; i++)
                    {
                        qTotal[i] += q[i];
                    }
                }
            });
        return qTotal;
    }
    

    If the amount of work inside the loop body is small you might still benefit from batching, but you might want to use a fixed batch size instead of basing it of the number of processors. This should help a bit with various types of overheads. You will likely need to experiment a bit to find your optimal batch size.