Archive

Posts Tagged ‘Compiler’

Delphi XE2-64bit: bottleneck in trigonometric functions?

September 22nd, 2011

Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value XE2-32 XE2-64
1.0 112 ms 86 ms
100 113 ms 125 ms
1e7 114 ms 3700 ms
1e14 128 ms 7600 ms

Choices, Choices, Choices

But… timing isn’t everything, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

  • just give up, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…
  • spit out something, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!
  • try to be accurate, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

Update: as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision may bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

Tips , ,

Happy {$EXCESSPRECISION OFF}!

September 9th, 2011

Just a notice: I’ve updated the XE2 single-precision floating point article after using the (up to now) undocumented {$EXCESSPRECISION OFF} directive, thanks to Allen Bauer for chiming in!

Executive summary: this directives enables use of single-precision SSE floating point instruction by the compiler, and brings their performance in line with expectations, making Delphi XE2 64bit compiler the new King of the Delphi Hill.

Edit: now documented here: Floating point precision control (Delphi for x64). That was fast!

Edit 2: an issue in the compiler was found related to non-explicitly typed constants by Ville Krumlinde, see QC #98753, so be aware of potential incorrect code generation with the initial version of the 64 bits compiler.

Edit 3: if you happen to be one of the error insight users, it will complain that the directive doesn’t exist.

Edit 4: the compiler issue was fixed for Delphi XE2 Update1!

Let’s hope that this directive will be fully supported, and in time face the same fate as $STRINGCHECKS did (ie. become another scary story for the long winter nights).

News , ,

XE2 single-precision floating point (partial) disappointment…

September 5th, 2011

In the previous episode, it appeared that Delphi XE2 64bit compiler was achieving quite good results, however, after further investigations, things may not be so clear-cut. Transcendental maths, which will be food for a another post, the subject of this one seems to be an issue with single-precision floating point maths.

edit: it appeared there is an undocumented {$EXCESSPRECISION OFF} directive which controls the generation of the conversion opcodes hampering single-precisions floating point performance, the articles has been updated. Thanks Allen Bauer, Andreano Lanusse & Leif Uneus for bringing it to attention!

Single precision

Single precision having a smaller memory footprint and being typically processed faster (especially when using SIMD, f.i. SSE allows processing 4 single-precision floats at the same time, while you can process only 2 double-precision floats at a time with SSE2), thus it is often encountered in performance-critical code where precision isn’t essential. One typical such use is for 3D computations, meshes, and geometry.

Most 3D engines out there make heavy use of single-precision floating point (GLScene and thus FireMonkey too), and it’s the primary native float data type expected by most graphics hardware.

Updated benchmark charts

However, the new 64bit compiler doesn’t like single precision floats, while the 32bit compiler likes them, this leads to this interesting chart:

Mandelbrot times (ms), lower is better

XE2 – 32 bits XE2 – 64 bits
Single Precision… 115 257 / 66*
Double Precision… 193 67

There are two figures in the 64bit single precision case, the high figure is what you see if you just compile with optimizations (yes, you’re seeing this right, single precision floating point math in Delphi 64bit behaves worse than double-precision maths in Delphi 32bits!), and the low figure is if you use the undocumented (up until this article)  {$EXCESSPRECISION OFF} directive.

The new XE 64bit compiler can give you the best, and the worst: using single precision floats can make your 64 bits code almost 4 times slower if you don’t turn off “excess precision”, while it can make 32 bits code 70% faster…

Why, oh why?

The reason? The 64bit compiler doesn’t use scalar single precision opcodes if you don’t have “excess precision”, turned off and converts everything back and forth to double precision. Here is a snippet from the CPU view:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1468 F3480F5AC4       cvtss2sd xmm0,xmm4
00000000005A146D F3480F5ACC       cvtss2sd xmm1,xmm4
00000000005A1472 F20F59C1         mulsd xmm0,xmm1
00000000005A1476 F3480F5ACD       cvtss2sd xmm1,xmm5
00000000005A147B F34C0F5AC5       cvtss2sd xmm8,xmm5
00000000005A1480 F2410F59C8       mulsd xmm1,xmm8
00000000005A1485 F20F5CC1         subsd xmm0,xmm1
00000000005A1489 F3480F5ACA       cvtss2sd xmm1,xmm2
00000000005A148E F20F58C1         addsd xmm0,xmm1
00000000005A1492 F2480F5AC0       cvtsd2ss xmm0,xmm0

This is the similar code as for double precision, with loads of cvtss2sd & cvtsd2ss instructions thrown in! No mulss, subss or addss in sight, and yes, you can see redundant stuff happening, 4 lines are doing the actual computation, 6 are doing conversions, and doing them every… single… time.

If you’re a fan of “Lucky Luke“, the first two lines may remind you of a Dalton brothers prison break (even though the brothers are in the same cell, they each dig their own hole to freedom) ;-)

Now if you have the  {$EXCESSPRECISION OFF} directive, you see a different picture, the compiler uses single-precision opcodes as expected:

FMandelTest.pas.194: x := x0 * x0 - y0 * y0 + p;
00000000005A1450 0F28C4           movaps xmm0,xmm4
00000000005A1453 F30F59C4         mulss xmm0,xmm4
00000000005A1457 0F28CD           movaps xmm1,xmm5
00000000005A145A F30F59CD         mulss xmm1,xmm5
00000000005A145E F30F5CC1         subss xmm0,xmm1
00000000005A1462 F30F58C2         addss xmm0,xmm2

As Ville Krumlinde pointed in the comments, VS 2010 C-compiler has the same weird behavior.

I say weird, because if you go to the length of specifying single precision floating point, it’s usually because you mean it, and it’s trivial enough to have an expression be automatically promoted to double-precision by throwing in a Double operand or cast.

This reminds me of the old $STRINGCHECKS directive, which one had to remember to adjust or suffer lower string performance. Hopefully the hand holding will be reversed in the next version, with excess precision being off by default.

News , ,

First look at XE2 floating point performance

September 2nd, 2011

With XE2 now officially out, it’s time for a first look at Delphi XE2 compiler floating point performance (see previous episode).

For a first look I’ll reuse a Mandelbrot benchmark, based on this code Mandelbrot Set in HTML 5 Canvas. What it tests are double-precision floating-point basic operations (add, sub, mult) in a tight loop, there is relatively little in the way of memory accesses (or shouldn’t be, to be more accurate).

You can find the source code see there, it compiles pretty much straight away in XE2 (just comment out  the asm for Win64).

NOTE: when this article was originally posted, I had stumbled upon an XE2 Trial version “trap” (or feature?) which basically deactivated Win64 optimizations as defined through the project options. Kenji Matumoto pointed the issue, and this is an updated article where I used {$O+} in the code to “force” optimizations. The outcome is a *much* prettier picture, I’m happy to say! Reservations from the initial articles are gone, good job Embarcadero!

edit 05/09, after further tests, I’m adding one reservation single-precision floating point doesn’t look so hot. More on the subject there.

Benchmark results

Without further ado, here are the raw figures on my machine for the 480 x 480 case, keep in mind the Delphi versions do NOT use Canvas.Pixels[], but direct memory access in an array:

Execution time in milliseconds, lower is better

Or if you prefer hard figures:

  • Delphi XE2 – 32 bits: 193 ms
  • Delphi XE2 – 64 bits: 67 ms — fastest Delphi
  • Delphi XE: 196 ms
  • FireFox 6: 121 ms
  • Chrome 13: 74 ms
  • (out of competition: XE 32bit hand-made assembly: 57 ms)

So what gives?

  • XE2 32bit compiler still uses the old FPU code, the performance delta with XE is minimal and could just be an alignment issue (pseudo-random, since the compiler doesn’t pro-actively align). Let’s hope the SSE2 codegen will be retrofitted in XE3.
  • XE2 64bit compiler get a nice boost from using SSE2, allowing it to catch up and overtake all JavaScript JITters.
  • Chrome V8 makes a good showing in this benchmark, but loses the crown, native Delphi is back on top!

A peek under the hood

What does the compiler generate for the two following lines?

x := x0 * x0 - y0 * y0 + p;
y := 2 * x0 * y0 + q;

Once you pop up the CPU view, you’ll see:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1452 660F28C4         movapd xmm0,xmm4
00000000005A1456 F20F59C4         mulsd xmm0,xmm4
00000000005A145A 660F28CD         movapd xmm1,xmm5
00000000005A145E F20F59CD         mulsd xmm1,xmm5
00000000005A1462 F20F5CC1         subsd xmm0,xmm1
00000000005A1466 F20F58C2         addsd xmm0,xmm2
FMandelTest.pas.194: y := 2 * x0 * y0 + q;
00000000005A146A 660F28CC         movapd xmm1,xmm4
00000000005A146E F20F590DA2000000 mulsd xmm1,qword ptr [rel $000000a2]
00000000005A1476 F20F59CD         mulsd xmm1,xmm5
00000000005A147A F20F58CB         addsd xmm1,xmm3

And further down the code, the compiler makes use of xmm8, so it’s really aware of the 16 xmm registers you have in x86-64, and finally keeps floating poitn value in registers, something the 32bit compilers (both XE & XE2) don’t do.

To what does it lose to the hand-made asm version? Well a handful of minor things:

  • even though it used up to 9 xmm registers, it didn’t use 10th, leaving some memory access
  • with more careful allocation, it could have fit everything in 8 xmm registers, which would have cut unnecessary traffic
  • it zeroes register with a move from memory,  didn’t do constant unification or propagation.

Still those are mostly nitpickings compared to the massive issues of the old FPU code compilation (which, alas XE2 – Win32 still suffers from).

Conclusion

Support for SSE2 in XE2 64bit compiler consists in a significant step ahead for Delphi floating point performance. XE2 32bit is still same old.

If you’re doing heavy floating point maths, XE2 64bit compiler is a simple ticket to much better performance.

Hopefully in Delphi XE3 they will retrofitting the SSE2 codegen into the 32bit compiler, but ad interim it should quell all the critics about “we don’t need no 64bit”, well, if you do any significant floating-point maths, Delphi XE2 64bit is a must!

News , , ,

What innocuous-looking unit tests can uncover…

August 17th, 2011
Comments Off

I’ve recently been adding DWScript snippets to Rosetta Code, using them as unit tests as well.

Quite a few of Rosetta Code’s tasks consist in mathematical tasks, and I was wondering, how many math tests do you really need?

Well, quite a few! While implementing the Lucas-Lehmer test, it ended up hitting the precision boundary quite sooner than it should theoretically had, given that DWScript’s Integer is actually a 64bit integer.

Some investigations in the CPU view later turned out that the Delphi compiler did not  generate the proper CPU instructions for Sqr() in the case of integers, which DWS was relying upon. Apparently this has been QC’ed many times since Delphi 5, but still exists to this day in Delphi XE. The issue is now worked around in the SVN.

Fixed for XE2? Let’s hope, there may still be time…

 

 

Tips , ,

Spotlight on TdwsSymbolDictionary

February 19th, 2011

A useful, yet not very prominent class of DWScript is TdwsSymbolDictionary. It provides a reference of  all symbols mentioned in your script, where they are declared, where they are used, etc. It is optionally filled up when you compile a script with the coSymbolDictionary compile option, and can be accessed from a compiler program object or interface.

It can be used for multiple purposes, centered around IDE features and auditing. Here are a few usage scenarios.

In-place symbol help and navigation

The most basic use of TdwsSymbolDictionary is probably to figure out which symbol is under the cursor or the mouse in your code editor. You can then use the symbol to provide relevant contextual help (either from your help file, or auto-generated from the script, such as a list of parameters for a function, of fields for a class, etc.) or look up the dictionary once more, and provide information about where it’s declared, implemented, used or forwarded. That positional information can also be used to offer navigation shortcuts between declaration and implementation, or to any of the symbol’s occurrences in the source.

Auto-completion

The TSymbolDictionary can be used it to figure out what is currently being typed, by identifying the symbol at the cursor, or the last recognized symbol in the code being entered. From that point on, you can generate a list of contextual variables, methods, fields, etc. that can be used to fill up an auto-completion drop down.

Code Refactoring

Once you have a symbol pinpointed, either by name or through one of its occurrences in the source, you have access to all the places it’s occurring in the code. A trivial use of such a list is for a rename refactoring: as the list of symbol positions is already sorted by source file, line and column, you merely have to walk the list backwards and replace occurrences in your sources files.

You can also team up TdwsSymbolDictionary with  other information sources to implement more complex refactorings. For instance alongside TdwsSourceContextMap (which is also provided in a compiled program), you can implement “extract method” (or its variants, “pull up” & “push down”).

Code Auditing

The symbol dictionary can be a convenient starting place for symbol-related audits and diagnostics. All audits that relate to the amount of times a symbol is used (or not) being the most trivial.

Another straightforward usage is when you have naming conventions for fields, variables, classes etc. and/or to enforce case consistency. The later can be handled in your IDE in a similar fashion as a rename refactoring, except you only replace in suReference and suImplementation symbol positions, and it is harmless enough you can do it in a background task without requiring user interaction.

You can also use it to diagnose code issues, for instance such patterns:

someObject.someProperty[stuff].otherProperty.function.method(1);
someObject.someProperty[stuff].otherProperty.function.method(2);
someObject.someProperty[stuff].otherProperty.function.method(3);

can be heuristically detected (either as “too many member symbols on the same line” or “member symbol repeated too much in nearby lines”), and the attention brought to the hot-spots so the code can be simplified and cleaned up.

Tips , , ,

The limitations of Delphi’s “inline”

February 8th, 2011

Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.

I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:

type
   TFloatRec = record
      private
         Field : Double;
      public
         function RecGet : Double; inline;
   end;

   TMyClass = class
      private
         FRec : TFloatRec;
      public
         function Get : Double; virtual;
   end;

function TFloatRec.Get : Double;
begin
   Result:=Field; // here you could do a computation instead
end;

function TMyClass.Get : Double;
begin
   Result:=FRec.RecGet;
end;

Basically all you have are trivial functions that return the value of a floating-point field.

Given the above, for the TMyClass.Get method, the optimal codegen would look just like

fld qword ptr [eax+8]
ret

Simple enough, eh? Yet here is what the Delphi XE compiler generates:

Unit1.pas.326: begin
0053A794 83C4F0           add esp,-$10
Unit1.pas.327: Result:=FRec.Get;
0053A797 83C008           add eax,$08
0053A79A 8B10             mov edx,[eax]
0053A79C 89542408         mov [esp+$08],edx
0053A7A0 8B5004           mov edx,[eax+$04]
0053A7A3 8954240C         mov [esp+$0c],edx
0053A7A7 8B442408         mov eax,[esp+$08]
0053A7AB 890424           mov [esp],eax
0053A7AE 8B44240C         mov eax,[esp+$0c]
0053A7B2 89442404         mov [esp+$04],eax
Unit1.pas.328: end;
0053A7B6 DD0424           fld qword ptr [esp]
0053A7B9 83C410           add esp,$10
0053A7BC C3               ret

for the less-asm fluent, a direct pseudo-pascal translation of the above would be

var
   p : PDouble;
   temp1, temp2 : Double;
begin
   p:=@FRec.Field;
   temp1:=p^;
   temp2:=temp1;
   Result:=temp2;
end;

And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third temp3” Double (ie. it will perform even worse).

The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.

To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.

Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.

Tips , ,

Why no bytecode format?

October 26th, 2010

A compiled script, a TdwsProgram, can’t be saved to a file, and won’t ever be. Why is that?

This is a question that dates back to the first DWS, as it was re-asked recently, I’ll expose the rationale here.

  • DWS has a very fast compiler, it’s no performance problem to compile scripts instead of loading a binary representation that has to be de-serialized. How fast is it? See below.
  • DWS lets you define custom filters, that enable you to encrypt your scripts easily, if hiding the script source is what you were after with the bytecode.
  • DWS compiler/parser portion is quite light (currently less than 75kB), especially compared to the size of the Delphi libraries you’ll be using for the runtime. You probably won’t notice it in the EXE size once you expose more than a few trivial libraries.
  • Last but not least, when loading a binary representation of a script, you have to make sure all libraries are compiled into the application that loads and wants to execute the script, and that they are entirely backward-compatible with what was exposed to the script back when it was compiled. That is irrelevant when re-compiling.

How fast is the DWS compiler?

I did some quick benchmarking against PascalScript and Delphi itself.
I generated a script based on the following template:

    var myvar : Integer;
    begin
       myVar:=2*myvar-StrToInt(IntToStr(myvar));
    end;

The assignment line being there only once, 100 times, 1000 times, etc. The result was saved to a file, and the benchmark consisted in loading the file, compiling and then running it for DWS. For PascalScript, the times are broken down into compiling, loading the bytecode output from a file, and then running that bytecode. Disk size indicates the size of the generated bytecode.
All times are in milliseconds (and have been updated, see Post-Scriptum below):

For line counts expected for typical scripts (less than 1000), compared to PascalScript, the cost of not being able to save to a bytecode is a one-time hit in the sub-15 milliseconds range, on the first run.
It illustrates why it isn’t really worth the trouble maintaining a bytecode version for scripting purposes, and that’s also my practical experience.

For larger scripts, it is expected the execution complexity will dwarf the compile time: the benchmark code tested here doesn’t have any loops, anything more real-life will have loops, and will likely have a greater runtime/compiletime ratio.

What of Delphi?

For reference, I tried compiling the larger line counts versions with Delphi XE, from the IDE.

  • the 100k lines case took 3 minutes 27 seconds to compile (ouch!), obviously hitting some Delphi parser or compiler limitation. Runtime was 63 ms.
  • the 10k lines case in Delphi compiled in a more reasonable 2400 msec, and ran in 4 ms (50% faster than DWS).

What else? The DWS compiler has an initial setup cost higher than PascalScript, but as code size grows, it starts pulling ahead. That setup overhead will nevertheless bear some investigation ;-) .
Once compiled, the 10x execution speed ratio advantage of DWS vs PascalScript is consistent with other informal benchmarks.

Post-Scriptum

Gave a quick look at the setup overhead with SamplingProfiler, and found two bottlenecks/bugs. The outcome was the shaving off of 3 ms from the DWS compile times, ie. the compile times for the 1, 100 and 1000 lines cases are now 0.95 ms, 2.85 ms and 19.1 ms respectively.

Tips , , ,

All hail the “const” parameters!

July 28th, 2010

Passing parameters as “const” is a classic Delphi optimization trick, but the mechanisms behind that “trick” go beyond cargo-cult recipes, and may actually stumble into the “good practice” territory.

Why does it work?

The most well known case is “const String“, for which the compiler than takes advantage of the “const” to pass the String reference directly (as a pointer)… and without increasing the reference counter.

To illustrate what the reference counting implies, here are two screen-shots from the CPU disassembly view, taken in Delphi 6, but the recent compilers up to Delphi 2010 at least behave in a similar fashion (just with the Unicode version of the functions).
I’ve made a pseudo-function, with only one String parameter that calls a single function (here Length(), which isn’t inlined in Delphi 6, and corresponds to LStrLen), guess which CPU disassembly corresponds to “Test(const a : String)” and which to “Test(a : String)

What you’re seeing on the left is an implicit try…finally construct which is used to protect the reference counting (LStrAddRef/LStrClr) on the implicit local variable used to store the String parameter. If you have such calls nested a few levels deep (not uncommon in object-oriented code), you could accumulate quite some overhead.
Also not see here but hidden within the AddRef and Clr calls are bus locks, which may hit you disproportionately in multi-threading scenarios.

And what if you’re modifying the passed String? Can you forego the “const“? Well no, the extra overhead is still there, and using a “const” still beneficial.

String isn’t the only type affected, there are similar gains for all reference counted types (interfaces, dynamic arrays, records holding reference-counted types). And when passing a record as “const“, there is an additional gain in the lack of defensive copying of the record (the larger the record, the greater the gain).

For ordinal and numeric types, there is no compiler optimization (yet), but it can be good practice to mark them as “const“, not in the optimistic hope that someday the compiler will optimize it, but to ease up debugging and code writing. Copies of those parameters (when you want to modify them in the function body) are typically handled adequately by the compiler (and often enough comes for free), so don’t let performance considerations hold you.

There is one case in which using “const” could have adverse effects, but it involves global variables which you would modify or release during the call… so you don’t really need to know more about it, do you? ;)

How does it manifests itself in Profiling?

A missing “const” can manifests itself in different ways during profiling, for String, Interface and dynamic array types, you’ll usually see it via the relevant xxxAddRef or xxxClr functions (the obvious case) and via time spent in begin/end (the less obvious case). Depending on how many functions calls with the reference counting are involved in your inner loop, none of the above may come ahead, even if calls and reference counting are an issue: the time spent will be spread over the above functions and your various begin/end sections.
For record parameters, the lack of “const” could materialize itself in “Move“, or other data copying costs (large records will be moved, smaller ones can be copied via ad-hoc in-place code by the compiler).

In multi-threading, things can be a little less obvious, as rather than the reference counting and exception frame, it could be the bus locks hitting you, or sometimes worse, inter-CPU traffic to pass around the values of the reference counter.

Multi-threading’s worst case would be referring to the same string field from multiple threads, and passing that string around in functions without “const” parameters, thus repeatedly hitting the reference counter of that string from multiple threads at the same time, resulting in extra inter-CPUs traffic to share the value of that reference counter.

In such cases, “const” can help, but it will often not be enough, you’ll also have to look for functions/methods that return strings with an unnecessary reference counting, like when getting your strings from a TStringList.

Good Practice

Usually I tend to “const” just about every parameter that isn’t “var” or an object, this isn’t just  to pre-emptively alleviate possible performance issues, but also to gain something valuable when debugging and writing a function’s body: untouched input parameters, wherever you are in the function’s body. This means not just in the function’s body, but also when in the functions called by the function.

And that’s the key point to remember about “const” parameter: it’s here to state that a parameter will be left unchanged inside the body, something which allows the compiler to optimize it’s output, and the developer to optimize his thought process too. Qualifying all your parameters as either “const” or “var” makes your intentions obvious.

So IME, it’s not worth it to hoard your keystrokes, best spend them on “const” and explicit local variables when you really need to alter those input parameters.
Whoever has to maintain your code will be thankful.

Object parameters

There is one case I’ve not mentioned in all the above, the case of object parameters (and class references). It’s a special case because when you pass an object as “const“, you’re not making any promises of not keeping the object constant, but just keeping the underlying variable (pointer) constant. This is unlike other languages. There is also a risk of confusion when your code will be read by developers not really familiar with the underlying pointer-based mechanics of objects (and they are more common than you may think, even, and maybe more so in highly OO languages).

So for objects, unless you want to spend a lot of time educating, my rule of thumb is to avoid “const” or “var” for object parameters.

Beyond “const” parameters

Const’ing your parameters will eliminate some defensive copying, implicit try…finally, and reference counting overhead, but it leaves open another field of similar inefficiencies: that of return values.

Alas there is no “const Result” syntax, which would allow to state that your result can’t be altered in any way (a sort of strict immutability if you will). Thus whenever you have a function returning a String f.i., you’ll end up triggering the implicit reference counting and exception machinery if you’re not careful (like in this case)…

There are strategies to tackle this issue, but none of them are as simple as adding a “const“, and most of them come with sacrifices… so, let them be food for other articles.

Tips , , , , ,

Knowing what and when to optimize…

April 20th, 2009
Comments Off

…is as important as knowing how to optimize.

In this thread on the Delphi forums Ante Bonic brought back to intention this excellent Delphi Optimization Guide in Delphi article by Robert Lee. The article has aged a bit, but many tips remain true with the Delphi 2009 compiler (sadly so).  Like many optimization articles, Robert’s focuses on mostly local optimization tips, which can draw in warnings like this one one by Anders Isaksson:

Optimization should be done after profiling, not before.

Which I couldn’t agree more with. But to be fair, Robert’s states so in his article, as do most authors of optimization articles. Recipes and local optimization tips are to be used after all algorithmic and data structures improvements have been taken advantage off.

If one can list tips and tricks for local optimization, do’s and don’ts that are true often enough to be good tips in many scenarios. However, it’s practically impossible to come up with a “reusable” list of tips for algorithms and data structures. Too many specifics can come together, even when the problems are similar, considerations of scale or reactivity can drastically influence architectural and algorithmic options.

Hence the most visible optimization recipes are often local optimization ones, but mostly because there are few global optimization recipes. You only have global optimization methodologies. But even these methodologies can usually be summarized with few words:

  1. Time, profile, analyze and confirm your bottlenecks.
  2. Improve algorithms & data structures.
  3. Exhaust 1 & 2 before looking at local optimizations, and then don’t forget 1.

To optimize efficiently, ie. not waste your time, you have to master the first point.
To optimize effectively, ie. not waste the machine time, you have to master the second.

And the third point you ask? It’s a razor’s edge, when applied effectively, it can be very efficient, with very few changes like in this case, but if not, it’s a good way to end up there. To be effective, local optimization has to be about taking care of hidden machinery, hidden shortcomings of the compiler, hidden algorithms and data-structures that get in the way.

I’ll close this post by quoting Robert Lee’s article on timing:

Timing code is generally called “profiling”. If you want to improve the performance of your code, you first need to know precisely what that performance is. Additionally, you need to re-measure with each change you apply to your code. Do not spend a single second twiddling code to improve performance until you have analytically determined exactly where the application is spending its time. I cannot emphasize this enough.

Tips , , , , , , ,