- DelphiTools - https://www.delphitools.info -

Once upon a time in a thread…

Last episode in the TMonitor saga. In the previous episode, Chris Rolliston posted a more complete test case [1], for which he got surprising results (including that a Critical Section approach wouldn’t scale with the thread count). Starting from  his code I initially also got similar surprising results.

edit: apparently the “crash” part of the TMonitor issues have been acknowledged by the powers that be [2], and a hotfix could be on the way, though it points back to QC 78415 [3], an issue reported in 2009, ouch. Guess those 4 bytes per instance haven’t seen much use…

Revised Test with Stable Results

I simplified his code (see below), by dropping the usage of several RTL classes and features, and went for a straightforward implementation, in the process, the oddities went away as far as Critical Section is concerned, and partially so as far as TMonitor goes…
The results can be summarized by this chart:

This was measured on a quad-core, as you can see the Critical Section version stays flat until the number of threads gets greater than the core count, at which point, there is a small ramping arising from the workload taking its toll. TMonitor is a different story, if the revised test doesn’t exhibit the poor scaling I was finding in my previous test, there is still a ramping,  as well as a wild jump once there are more threads than cores.

Which RTL class or what exactly was the source of the behavior in Chris’s original code, I don’t know. One possible cause pointed by Krystian in a former comment could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start is major factor.

Note that TMonitor allocates its own small block for its locking purposes, distinct from the object instance, and AFAICT there are no provisions in case those blocks end up in the same cache line, though I’m not convinced yet that’s the issue we’re seeing here, this could be a source of contention.

edit: Krystian posted some sample code with cache-line collision avoidance, with it TMonitor becomes much more linear, though half as fast as a CS, and there are still occasional spurious slowdowns showing up in the timings.

Test Code

Here is the test code used for the above, if you test on your machine, make sure you have selected the high performance profile in Windows Power options, and that you don’t have any implicit affinity settings kicking in on the executable.

You can call the above code from a form where you’ll have dropped a TMemo to use as log, as I’m assuming you don’t want to slum it in a command line executable 😉

const
   cCountdownFrom = $FFFFFF; //increase if necessary...
   cMaxThreads = 10;

type
   TTestThread = class(TThread);

   TTestThreadClass = class of TTestThread;

   TCriticalSectionThread = class(TTestThread)
      FCriticalSection: TRTLCriticalSection;
      procedure Execute; override;
   end;

   TMonitorThread = class(TTestThread)
      procedure Execute; override;
   end;

procedure RunTest(log : TStrings; const testName : String; threadCount : Integer;
                  threadClass : TTestThreadClass);
var
   i : Integer;
   threads : array of TThread;
   tstop, tstart, freq : Int64;
begin
   SetLength(threads, threadCount);

   for i:=0 to threadCount-1 do
      threads[i]:=threadClass.Create(True);

   QueryPerformanceCounter(tstart);

   for i:=0 to threadCount-1 do
      threads[i].Start;
   for i:=0 to threadCount-1 do
      threads[i].WaitFor;

   QueryPerformanceCounter(tstop);
   QueryPerformanceFrequency(freq);

   log.Add(Format('%s: %d thread(s) took %.1f ms',
                  [testName, threadCount, (tstop-tstart)*1000/freq]));

   for i:=0 to threadCount-1 do
      threads[i].Free;
end;

procedure TCriticalSectionThread.Execute;
var
   counter : Integer;
begin
   InitializeCriticalSection(FCriticalSection);

   counter:=cCountdownFrom;
   repeat
      EnterCriticalSection(FCriticalSection);
      try
         Dec(counter);
      finally
         LeaveCriticalSection(FCriticalSection);
      end;
   until counter<=0;

   DeleteCriticalSection(FCriticalSection);
end;

procedure TMonitorThread.Execute;
var
   counter : Integer;
begin
   counter:=cCountdownFrom;
   repeat
      System.TMonitor.Enter(Self);
      try
         Dec(counter);
      finally
         System.TMonitor.Exit(Self);
      end;
   until counter<=0;
end;

procedure RevisedChrisTest(log : TStrings);
var
   i, j : Integer;
begin
   for i:=1 to 3 do begin
      log.Add('*** ROUND '+IntToStr(i)+' ***');
      for j:=1 to cMaxThreads do begin
         RunTest(log, 'TCriticalSection', j, TCriticalSectionThread);
         RunTest(log, 'TMonitor', j, TMonitorThread);
      end;
   end;
end;