TMonitor woes

Primoz Gabrijelcic recently reported a possible bug with TMonitor, in the more advanced side of TMonitor.

However, when experimenting with it for DWS, I bumped on issues in the basic usage scenarios too, and reverted to using critical sections. It seems that as of Delphi XE, short of a patch, TMonitor is just a waste of 4 bytes per object instance.

One of the basic usage of TMonitor I’m referring to would be to replace a critical section:

TMonitor.Enter(someObject);
try
   // protected code
finally
   TMonitor.Exit(someObject);
end;

However, TMonitor is having a problem with the above, and you can quickly run into situations where everything gets serialized, even when there is no need to. Let’s look at a minimal thread:

type
   TMyThread = class(TThread)
      Count : Integer;
      Obj : TObject;
      procedure Execute; override;
   end;

procedure TMyThread.Execute;
begin
   while not Terminated do begin
      System.TMonitor.Enter(Obj);
      try
         Dec(Count); // or do something else
         if Count<=0 then Break;
      finally
         System.TMonitor.Exit(Obj);
      end;
   end;
end;

Assuming you create two instances of the above thread class, which are working on two different “Obj” instances, the two threads should be able to run in parallel, as they don’t operate on the same memory structures at all, right?
Well, if you use plain old critical sections, they will, but if you use TMonitor like in the above code, they won’t, they’ll just run serialized, and all but the first thread will suffer from severe contention, which hints that a race condition is hiding somewhere in TMonitor’s code…

17 thoughts on “TMonitor woes

  1. I think you are wrong.
    I just took your example thread and created 4 of them with 4 different TObject instances and they ran parallel. Then I changed the Enter/Exit to work on some global instance and they ran serialized.

  2. Having read the source code I must say that I would not be surprised to see bugs in this class.

  3. @Stefan Glienke
    Here is some test code that reproduces the issue

    var
       t1, t2 : TMyThread;
       dt : TDateTime;
    begin
       t1:=TMyThread.Create(True);
       t1.Obj:=TObject.Create;
       t1.Count:=1000000;
    
       t2:=TMyThread.Create(True);
       t2.Obj:=TObject.Create;
       t2.Count:=t1.Count;
    
       dt:=Now;
    
       t1.Start;
       t2.Start;
    
       t1.WaitFor;
       t2.WaitFor;
    
       Caption:=Format('%.3f ms', [(Now-dt)*86400*1000]);
    end;

    A single thread runs in 70 ms on the quad-core here, two threads run in 140 ms, 3 threads in 210 ms, etc.

  4. I think that it’s not related to Delphi directly, TMonitor is using InterlockedCompareExchange (asm LOCK instr. used), then processor must do some kind of cache synchronization between all of processors, which in that case is most time consuming.

  5. @Eric
    Seriously? Execution time as a proof that they don’t run parallel? This is so wrong… As Krystian mentioned most of the time is used for overhead. Put some logging inside the TMonitor Enter/Exit block and you will see that they run parallel.

  6. @Stefan Glienke
    Execution time is the one that becomes visible in 2 lines of code, if you look into the execution details, as I wrote, you’ll also find thread contention.
    If there was no race conditions, but merely serialization, all the threads would run alternatively, this is not the case with TMonitor.

    As for logging as proof of parallelism, erm… what kind of logging code do you use?

  7. @Eric
    Using SmartInspect EnterMethod/LeaveMethod inside the TMonitor Enter/Exit block and you can see that both EnterMethods are logged before both LeaveMethods are logged.

  8. @Eric
    In fact you only need this and you see that the second TMonitor.Enter is not blocked:

    procedure TMyThread.Execute;
    begin
    begin
    System.TMonitor.Enter(Obj);
    try
    OutputDebugString(PWideChar(‘-> Execute ‘ + IntToStr(ThreadID)));
    Sleep(1000);
    finally
    OutputDebugString(PWideChar(‘<- Execute ' + IntToStr(ThreadID)));
    System.TMonitor.Exit(Obj);
    end;
    end;
    end;

  9. @Stefan Glienke
    You’re aware that OutputDebugString prevents normal thread execution as it has to be handled by the *single* debugger thread?

    I don’t know what SmartInspect uses in its EnterMethod/LeaveMethod implementations, but they likely aren’t innocuous either.

    I asked about your logging methods, because all common logging approaches are invalid when it comes to observing race conditions and thread contention. You need a per-thread dedicated logger to observe the effect without (too much) affecting the behavior, with timestamped-entries, and post-thread-execution log ordering, or you need something completely different in the form of a sampling-based logger f.i.

  10. @Eric
    Well apart from logging if you say those threads are serialized. How long does it take to execute both with Sleep(1000) in them? 2 seconds? No, it takes around 1 second.

  11. @Stefan Glienke
    In case you’re not aware, Sleep() relinquishes a thread’s execution time, so by doing a “sleep” in your threads, you’re not testing the thread being busy, but something else (which is the OS being able to wake up a sleeping thread after a set time has passed).
    It effectively hides all threading issues under the carpet, since with 1 lock per second, you just won’t see no thread contention or race conditions.

  12. *grabs some popcorn*

    You really ougtta quit while you’re behind, Stefan. Eric’s the guy who wrote Sampling Profiler. That pretty much by definition makes him an expert on the Windows threading model…

  13. @Mason Wheeler
    True, I am no expert on threading and this might be funny for you showing my knowledgeless here 😉 Anyway even experts can be wrong. But simply saying: hey, this takes double the time, it’s not running parallel is no proof. In my case the numbers for his example are jumping between 109ms and 251ms. If I remove the TMonitor the are between 31ms and 79ms.

  14. @Eric
    I still think that your test gives a ‘false’ results, because you are creating two small objects instances one after one which probably FastMM puts in same 64 bytes boundary which on x64 is processor cache size line (on x86 it’s 32 bytes), so access to that memory from two processors in interlocked (or even regular?) opetation must be synchonized.
    Try make objects larger than 64 bytes.
    See this small test app:
    kblib.googlecode.com/files/LOCKtest.zip

  15. @Krystian
    The TMonitor logic allocates a small memory block for its own purposes, distinct from the object (where the 4 bytes are used to point to that memory block), it’s a small memory block indeed, and thus vulnerable to the issue you refer to.
    Though AFAICT, that’s yet another design issue of TMonitor, I’ve made a few tests allocating small buffers of the same same size to make sure they weren’t in the same cache line, yet, the issue still happened…

    @Stefan Glienke
    I’m not seeing anywhere near such a variability in execution times, are you sure you’re running under a high performance profile, with speedstep, cool’n’quiet, etc. disabled?

Comments are closed.