multithreading performance sorting delphi utilization

25% efficiency of 4 cores processor with Delphi Threads

Could anyone help me with this problem?

I'm trying to achieve more efficiency that's why I tried to parallel my calculations. After a few tests the results showed me that nothing is faster than calculation on 1 thread. its only 25% of processor load in both cases (1 thread and 4 threads). Could any one know why it's happening? Is there anything I can do to achieve 100% of efficiency (even 90% be better than 25%)?

Below you can see an example code:


ToolsThread = class(TThread)
public
 procedure Execute(); override;
 procedure QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
 Procedure QSortT(var dict: TArray<AnsiString>);
 constructor Create();
var
 tab : TArray<AnsiString>;
 tmp1: Longint;
end;

procedure ToolsThread.QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
var
 Lo, Hi: Longint;
 Pivot: Pointer;
 T: Pointer;
begin
  Lo := iLo;
  Hi := iHi;
  Pivot := pointer(dict[(Lo + Hi) shr 1]); // shr 1 is slightly faster than div 2;
  repeat
    while dict[Lo] < AnsiString(Pivot) do Inc(Lo);
    while dict[Hi] > AnsiString(Pivot) do Dec(Hi);
    if Lo <= Hi then
    begin
      T := pointer(dict[Lo]);
      pointer(dict[Lo]) := pointer(dict[Hi]);
      pointer(dict[Hi]) := T;
      Inc(Lo) ;
      Dec(Hi) ;
    end;
  until Lo > Hi;
  if Hi > iLo then QuickSort(dict, iLo, Hi) ;
  if Lo < iHi then QuickSort(dict, Lo, iHi) ;
end;

Procedure ToolsThread.QSortT(var dict: TArray<AnsiString>);
begin
 QuickSort(dict, 0, Length(dict)-1);
end;

procedure ToolsThread.Execute();
var
 tmp1, tmp2 : Longint;
 dict: TArray<AnsiString>;
begin
 SetLength(dict, 10000000);
 for tmp1:= 0 to 10000000-1 do
  dict[tmp1] := IntToStr(Random(high(integer)));
 QSortT(dict);
end;

Procedure Main;
var
 Th1, Th2, Th3, Th4: ToolsThread;
begin

 Th1 := ToolsThread.Create();
 Th2 := ToolsThread.Create();
 Th3 := ToolsThread.Create();
 Th4 := ToolsThread.Create();

 debug('Start THR');
 Th1.Start;
 Th2.Start;
 Th3.Start;
 Th4.Start;
 th1.WaitFor;
 th2.WaitFor;
 th3.WaitFor;
 th4.WaitFor;
 debug('THR Done');
end;

Corrected according suggestions. Still 25% CPU load (5-8% per thread)

SOLVED! There is a general problem with some Delphi memory management in multiprocessing. It's not fastMM4 problem and it is resolvable only as workaround for now.

Solution

First of all, use string everywhere. Don't mess with AnsiString and string types - unless you are with a pre-Delphi 2009... And is you use AnsiString, then IntToStr() will return a plain string which will then be reallocated and converted into an AnsiString when put in the array...

From what I can see in your code, my guess is that most of the time is spent into calling IntToStr(), then converting the returned string into AnsiString, which will involve the heap manager, which will be mostly serialized in the process. Of course, it is a quick guess, with my "brain profiler" (tm) which uses to be wrong.

A QuickSort algorithm is O(n*log(n)) which is faster than O(n) calls of IntToStr() from multiple threads.

Try putting only QSortT(dict) in the Execute method. It should be almost linear.

Another guess is that the following lines are not optimal. They will involve reference counting when swapping strings (assigning strings is in fact a call to an hidden _UStrAsg() low function), which takes a lot of CPU power due to a "lock" on the reference counter. If multiple threads share some CPU cache during the "lock" (which may occur in your case, since IntToStr allocations are done in parallel), then this "lock" would create some cache contention, which could take some noticeable time, even on newer CPUs.

The following may be faster (to be verified):

   var T: pointer;
   ...
    if Lo <= Hi then
    begin
      T := pointer(dict[Lo]);
      pointer(dict[Lo]) := pointer(dict[Hi]);
      pointer(dict[Hi]) := T;
      Inc(Lo) ;
      Dec(Hi) ;
    end;

You may also replace pivot with a pointer but it is a bit more tricky:

var pivot: pointer;
...
  repeat
    Pivot := pointer(dict[(Lo + Hi) shr 1]); // within the 2nd repeat..until
    while dict[Lo] < string(Pivot) do Inc(Lo) ;
    while dict[Hi] > string(Pivot) do Dec(Hi) ;

Idea is always to use real profiling in such cases. At least, use precision timer/watch in the code you want to measure, and see where the CPU goes into.

Edit: I have written some code using my advices - see https://gist.github.com/synopse/02eb142d35cb44126aed9fd3200a76d1

output is on a 2 core CPU:

Fill done in 3.25s
Sort done in 24.11s

with one core used during Fill/Create - as expected
with 100% of cores used during Sort/Execute - as expected