DelphiTools Profiling multi-threaded applications

SamplingProfiler ^[1] has a few options to help profile a multi-threaded application which I’ll go over here.

In the current version, those options allow identifying CPU-related bottlenecks, as in “threads taking too much CPU resources or execution time”. However, they do not provide much clues yet to pinpoint bottlenecks arising from thread synchronization issues or serialization (insufficient parallelism). Hopefully, more support for profiling multi-threaded applications will come in future versions.

Single-threaded profiling

By default, SamplingProfiler only looks at one thread, the main application thread, but you can manually (and dynamically) specify another thread. This is done via OutputDebugString (see Control sampling from your code ^[2])

OutputDebugString('SAMPLING THREAD threadID');

with threadID the thread ID (as returned from the WinAPI function GetCurrentThreadID f.i.).

If you specify an invalid threadID, or if the thread dies, no more samples will be collected until you specify a new thread or “return” the sampling focus to the main thread, which can be accomplished with

OutputDebugString('SAMPLING THREAD 0');

This command is mostly useful if you already have a clue which thread is proving troublesome, like when a worker thread is used in GUI interface. If you have several worker threads in a thread pool ^[3], which serve random workloads (or assumed random), you can pick one of those threads (at random) and have it profiled.

However, this involves a fair amount of bias and guessing where the bottleneck could be, and is not really applicable if you have a high number of threads working (or sleeping) simultaneously on multiple CPUs. This is where comes in…

Monte-Carlo Samples Gathering

Monte-Carlo ^[4] sampling is specified via the samples gathering mode option, when set, SamplingProfiler will pick a random thread of the profiled application at each sampling, and use it for the sample. Bias and guessing are eliminated.

The good news is that with this method, the sampling load is not increased, and its impact is random: concurrency issues and UI bottlenecks can still be spotted. Hot-spots in a server running at production speed can be spotted too.

The bad news is that if you have a high number of inactive threads, you’ll have to gather more samples to get meaningful results on the active threads (as each time an inactive thread is picked at random, the sample will be meaningless, and thus lost).

Interpreting the profiling results can however be a little more difficult, as several multi-threading effects can come into play, for instance a drop in CPU cache efficiency (code stressed in highly threaded situations can behave quite differently from what it looks when stressed in single-threaded situation). This will be food for future articles.

To decide if a thread is active or not, SamplingProfiler looks at its registers: if all the registers are unchanged between two samples, the thread is deemed inactive and the sample dropped.

Inactivity can thus result from the thread being sleeping or waiting on some event, or just from having not gotten its share of CPU time since the last time it was sampled. This can be quite common if you have a much higher number of threads than you have CPU cores, even if all the threads are busy).

CPU Affinity

The last set of options is the one for processor affinities ^[5]. You can choose on which CPUs SamplingProfiler is constrained, and on which CPUs the profiled application is constrained.

Affinities can be used either to further isolate the profiled application from the profiler, or to easily simulate your application running on a machine with less cores. In more advanced scenarios, if you have enough CPU cores, you can also leave CPU cores entirely unused by both the profiler or the profiled, and thus reserve them to a third application (such as a database server).