Delphi array constructors performance (or lack of)

Previous: Diagnostic.

Overhead in detail

The final outcome is that using the Create magic constructor can incur quite a bit of overhead:

  • a DynArrayClear call (not sure why it’s there), that will release the previously assigned block of memory for the temporary array
  • a DynArraySetLength, that will allocate a new block of memory and zero it
  • a DynArrayAssign, that will trigger the release of the memory for the existing array (if it wasn’t empty), along with a bus lock for the reference count overhead
  • extra initialization and finalization for the temporary array

In a multi-threaded applications, all that extra memory management and bus locking is going to have a disproportionate impact on performance. If you test the above snippets in a multi-threaded environment, you’ll notice that when using the array constructor, execution quickly becomes single threaded, bottle-necking on the memory manager and bus locks.

The manual initialization only has a single DynArraySetLength call, and if the array is not empty, this may not result in a new block being allocated (as the existing memory block could just be resized in place). So if you initialize the same array variable more than once, the manual form can be quite cheap.

A better array initializer?

Now that I showed you the magic array Create constructor is no good, what if you still want something convenient? Well open arrays can come to the rescue:

procedure InitializeArray(var a : TIntegerArray; const values : array of Integer);
begin
   SetLength(a, Length(values));
   if Length(values)>0 then
      Move(values[0], a[0], Length(values)*SizeOf(Integer));
end;
...
InitializeArray(a, [1, 2]);

The above function won’t be as efficient as manual initialization: there is an extra function call and the values will be copied twice. However it eliminates all the extra memory management and bus locking, so will scale quite better in multi-thread, while being compact syntax and code-wise.

Note that for a managed type (String, Interface…) then System.Move can’t be used, you’ll need to use either asm hackery or a for-to-do loop with item-by-item assignment, which will incur a performance hit, and often make it non-competitive with the manual version.

Need even more speed?

In the grand scheme of things however, all the above approaches suffer from the SetLength call, which is quite complex (have a look at DynArraySetLength in the System.pas unit… and weep), so if you know there is a chance the dynamic array wasn’t resized,  in the manual version, you can gain by doing

if Length(a)<>Length(value) then
   SetLength(a, Length(Values));

Which can when the SetLength is waived, net you more than a mind boggling 10x speedup (ten times).
Ouch! Why doesn’t the RTL do that?

Well, it doesn’t do that because it can’t, as Delphi’s dynamic arrays are not some kind of hybrid half-way between a value type and a reference type, and SetLength is the key stone where all the hybridization happens (for more on the subject, see Dynamic Arrays as Reference or Value Type).

And FWIW, in DWScript, arrays are first-class reference types, which means they can have more capability, and their initialization syntax is also more compact, the above initialization is just:

a := [1, 2];

And if you’re using Smart Pascal and running it in Chrome V8 or node.js, well, let’s just say you’ll need to use all the above tricks for Delphi to come ahead performance-wise.

8 thoughts on “Delphi array constructors performance (or lack of)

  1. DynArrayClear call should be an optimization. Without it the subsequent DynArraySetLength call will copy old `a` contents to a new location if `a nil`.

  2. Take a look about how DynArraySetLength() in System.pas is implemented.

    Checking Length(a)length(Value) will not be much faster, since you will just bypass simple RTTI lookup, some adds, one mul and one div, then call reallocmem which is a no op for FastMM4 if the length is the same. And in most cases, Length(a) is indeed not equal to length(Value), so your optimization will slow down everything.

    Dynamic arrays are reference counted values, with “Copy On Write” pattern.

    DynArrayClear is there to handle the reference counting of any previous dynamic array instance stored in the variable. Here, your local variable is nil, so it will return with no delay. But if a should be nil, it will release any previous instance (if not copied anywhere).

    This is exactly the same for a string, a variant or an interface (or a record containing any of those types).

    What is slow with dynamic arrays is not the initialization, but the change of size. If you use SetLength(a,length(a)+1) to add items, it will be dead slow.
    This is why in our TDynArray wrapper, we allow to define an external “Count: integer” variable, so the array length is not the actual count number, but the array capacity.

    In our enhanced RTL for Delphi 7 and 2007, we rewrote some part of the DynArraySetLength() low-level function. But speed increase is not huge.

    Note that DynArraySetLength() current implementation in System.pas is not thread-safe about the reference counting (whereas it is thread safe e.g. for strings since Delphi 5).

    What may render dynamic arrays faster in Delphi should be the emission of code instead of using the RTTI (similar for record copy). Up to now, optimization exists if there is no RTTI for records, but dynamic arrays or records containing reference counts uses RTTI, which is slower. By the way, this is a reason similar to why TValue/variant types are slower than some optimized versions (like TOmniValue) using proper inlining features, which allows some compile-time optimization, by passing RTTI.

  3. A. Bouchez :

    Checking Length(a)length(Value) will not be much faster

    This is not what benchmarking reports, the 10x speedup us easy to observe in that case.

    And in most cases, Length(a) is indeed not equal to length(Value), so your optimization will slow down everything.

    A simple comparsion like that one that is correctly branch-predicted is very cheap, you’ll be hard-pressed to get measurable a difference in benchmarks.
    And if it’s not correctly branch-predicted, then it means you’re in the same size-situation, and will avoid branch mispredicts that happen in DynSetArrayLength.

    Dynamic arrays are reference counted values, with “Copy On Write” pattern.

    They’re not, Strings are copy-on-write, dynamic arrays are just reference-counted types, which happen to be duplicated by SetLength when there is more than one reference to them. They’re hybrids.

    Here, your local variable is nil, so it will return with no delay.

    It doesn’t in the real world, the reason is that this test is hard to branch-predict, from the POV of the CPU it’s close to random as you have a mix of nil and non-nil arrays which all go through the same test. And a branch mispredict is very costly on modern architectures.

    What is slow with dynamic arrays is not the initialization, but the change of size. If you use SetLength(a,length(a)+1) to add items, it will be dead slow.

    It’s dead-slow even when the size isn’t changed, benchmark it if you doubt it, the RTL source doesn’t give the whole story.

    Btw, the gains in your DynArray wrapper with a capacity come from that aspect as well.

    When rewriting DynArraySetLength, you don’t get around the branch mispredictions (since there is still only one routine, with only one location for the jumps), but with your wrapper, there are multiple source locations for the Count comparison, which means they are a bit more context-specific, and thus better branch predicted.

    Branch misprediction can be spotted when profiling: if the most expensive line is an innocuous-looking trivial test, chances are this is a misprediction. Normal bottlenecks are arithmetic, read/write from memory, complex calls, etc. When a simple boolean test comes on top, it’s because it’s mispredicted.

    What may render dynamic arrays faster in Delphi should be the emission of code instead of using the RTTI (similar for record copy).

    Yes, this was reported way back in D5 or D7, alongside being able to have custom constructors/destructors for records, rather than the RTTI-based RTL ones (which are structurally slow).

    By the way, this is a reason similar to why TValue/variant types are slower

    You may want to check this 🙂
    http://www.cromis.net/blog/2013/02/tanyvalue-an-attempt-to-make-the-best-variable-data-container/

  4. You are right: they are not COW values, just reference counted. But not 100% thread safe!

    Of course, in your case, checking length() before SetLength() will be 10x faster, but I suspect it won’t exist in real world, unless your code is very bad written.

    The “a” value is always initialized to nil, during the function/method local stack initialization, just like any reference-counted variable. Check the generated asm code by the compiler.

    In all cases, branch mis-prediction is nothing in comparison to a ReallocMem(), a InitializeArray() or a move() – as run within DynArraySetLength(). TDynArray is not much faster because of branch prediction and CPU cache issues, but because it bypassed the reallocation. This is what a profiler (like your great sample profiler) shows on real code, not dedicated benchmarks like calling SetLength() in loop. Then accessing a dynamic array is a O(1) opcode, just like any array.

    I know the link in cromis.net – I just did not want to report the whole story.

    DWS array process is very good. But I still miss auto-detection of optimized HTML5 byte/ord/integer/int64 arrays in JavaScript compilation, which is not yet the case. An “array of integer” will be slower than with Delphi, I suspect, about memory use and data access.

  5. A. Bouchez :

    DWS array process is very good. But I still miss auto-detection of optimized HTML5 byte/ord/integer/int64 arrays in JavaScript compilation, which is not yet the case. An “array of integer” will be slower than with Delphi, I suspect, about memory use and data access.

    I’m monitoring progress with JS typed arrays, but at the moment they’re still slower than regular JS arrays, so they’re only beneficial for memory space (when large) or to talk with WebGL. Arrays benefit from quite a lot of optimizations in Chrome V8.

    With the JS compiler you already avoid issues of mixed type arrays (integers & floats f.i.) which are detrimental to v8 optimizations, and the compiler also always issues variable declaration with an initialization (from which V8 infers typing), but there are some other things to be aware of, like try..except/finally which turns off some JITing features.

  6. @Eric
    Thanks for the feedback about V8 inferring typing.

    I did read this some time ago, but it was out of my mind.

    I was wondering why all JavaScript code generated by DWS/SMS do initialize all arrays. Sounded like something not mandatory, and even worse – overweighting the code. But in fact, from the JIT point of view, such initialization is warmly welcome!

    Great work, Eric!

Comments are closed.