Parallel write to array slower than serial write using OmniThreadLibrary

I am working on an implementation of the Differential Evolution optimization algorithm, and want to speed up the calculation time by calculating population members in parallel. I am using the OmniThread library, and have successfully parallelized my loop, only to find that it runs SLOWER than the serial implementation.

I have reduced the code to its essence to test the parallelization, and the reduced version exhibits the same problem: the parallel version is slower than the serial one.

The key is that I pass multiple dynamic arrays to which output should be written for each member of the population. Each array has one of the dimensions dedicated to the population member, so for each population member a different set of array indices is accessed. This also means that in the parallel implementation no 2 threads will write to the same array element.

Below the code I used to test (the actual code in the Differential Evolution has a DoWork procedure with even more const parameters and var arrays)

unit Unit1;

interface

type
  TGoalFunction = reference to function(const X, B: array of extended): extended;
  TArrayExtended1D = array of extended;
  TArrayExtended2D = array of TArrayExtended1D;

  TClassToTest = class abstract
  private
    class procedure DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
  public
    class procedure RunSerial;
    class procedure RunParallel;
  end;

function HyperSphere(const X, B: array of extended): extended;

const
  DIMENSION1 = 5000;
  DIMENSION2 = 5000;
  LOOPS = 10;

implementation

uses
  OtlParallel;

function HyperSphere(const X, B: array of extended): extended;
var
  I: Integer;
begin
  Result := 0;
  for I := 0 to Length(X) - 1 do
    Result := Result + X[I]*X[I];
end;

{ TClassToTest }

class procedure TClassToTest.DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
var
  I: Integer;
begin
  AOutputArray1[AIndex] := AGoalFunction(AInputArray[AIndex], []);
  for I := 0 to Length(AOutputArray2[AIndex]) - 1 do
    AOutputArray2[AIndex, I] := Random*AIndex2;
end;

class procedure TClassToTest.RunParallel;
var
  LGoalFunction: TGoalFunction;
  LInputArray: TArrayExtended2D;
  LOutputArray1: TArrayExtended1D;
  LOutputArray2: TArrayExtended2D;
  I, J, K: Integer;
begin
  SetLength(LInputArray, DIMENSION1, DIMENSION2);
  for I := 0 to DIMENSION1 - 1 do
  begin
    for J := 0 to DIMENSION2 - 1 do
      LInputArray[I, J] := Random;
  end;
  SetLength(LOutputArray1, DIMENSION1);
  SetLength(LOutputArray2, DIMENSION1, DIMENSION2);

  LGoalFunction := HyperSphere;

  for I := 0 to LOOPS - 1 do
  begin
    Parallel.ForEach(0, DIMENSION1 - 1).Execute(
      procedure (const value: integer)
      begin
        DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, value, I);
      end
    );

    for J := 0 to DIMENSION1 - 1 do
    begin
      for K := 0 to DIMENSION2 - 1 do
        LInputArray[J, K] := LOutputArray2[J, K];
    end;
  end;
end;

class procedure TClassToTest.RunSerial;
var
  LGoalFunction: TGoalFunction;
  LInputArray: TArrayExtended2D;
  LOutputArray1: TArrayExtended1D;
  LOutputArray2: TArrayExtended2D;
  I, J, K: Integer;
begin
  SetLength(LInputArray, DIMENSION1, DIMENSION2);
  for I := 0 to DIMENSION1 - 1 do
  begin
    for J := 0 to DIMENSION2 - 1 do
      LInputArray[I, J] := Random;
  end;
  SetLength(LOutputArray1, DIMENSION1);
  SetLength(LOutputArray2, DIMENSION1, DIMENSION2);

  LGoalFunction := HyperSphere;

  for I := 0 to LOOPS - 1 do
  begin
    for J := 0 to DIMENSION1 - 1 do
    begin
      DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, J, I);
    end;

    for J := 0 to DIMENSION1 - 1 do
    begin
      for K := 0 to DIMENSION2 - 1 do
        LInputArray[J, K] := LOutputArray2[J, K];
    end;
  end;
end;

end.

I was expecting a speedup of around x6 on my 8-core processor, but was faced with a slight slowdown. What should I change to get the speedup from running the DoWork procedure in parallel?

Note that I'd prefer to keep the actual work in the DoWork procedure, since I have to be able to call the same algorithm with and without parallelization (boolean flag) while keeping the body of the code shared for easy maintenance

Solution

This is due to the lack of thread safety of Random. The implementation of which is:

// global var
var
  RandSeed: Longint = 0;    { Base for random number generator }

function Random: Extended;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Extended;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

Because RandSeed is a global variable, which is modified by a call to Random, the threads end up having contended writes to RandSeed. And those contended writes cause your performance problem. They effectively serialize your parallel code. Severely enough to make it slower than the true serial code.

Add the code below to the top of the implementation section of your unit and you'll see the difference:

threadvar
  RandSeed: Longint;

function Random: Double;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Double;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

With that change to avoid shared, contended writes, you'll find that the parallel version is faster, as expected. You don't get linear scaling with processor count. My guess is that is because your pattern of memory access is sub-optimal in the parallel version of the code.

I'm guessing that you are only using Random as a means to generate some data. But if you do need an RNG, you'll want to arrange that each task uses their own private instance of an RNG.

You can also speed up your code a little using Sqr(X) rather than X*X, and also by switching to Double instead of Extended.