High Performance Delphi

Home Fundamentals Guide Code Links Tools Feedback

Integer Optimization Guidelines

In 32 bit code, such as generated by Delphi 2 and later, things just get better when the values being manipulated have a size of 32 bits. 16 bit variables (Word, ShortInt, WideChar) are especially slow as they require the processor to temporarily slip into 16 bit mode to work with them. This can double the time it takes to work with these values. 8 bit variables (Byte, SmallInt, Char) are not as bad, especially, if you do not mix their usage with 32 bit values. However, they can still cause the inclusion of additional instructions in order to zero out the rest of the 32 bit register.

If you must use a smaller type for compatibility, convert it to 32 bit as soon as possible, and back to the smaller size (if necessary) just prior when it is needed. You do this simply by assigning it to a 32 variable.

Avoid ordinal subranges

One of the advantages of Pascal has traditionally been its strong typing, and so the ability to create special subrange types and enumerations has been part of this. Unfortunately, subranges and enumerations can cause trouble when attempting to optimize for performance. The problem lies in the fact that the underlying variable type choosen hold a subrange or enumeration variable is based on the size of the subrange. For example, enumerations with less than 256 elements or subranges with boundary values ranging between 0 and 255 will be stored as byte. This can lead to trouble in that the underlying variable size may not be handled as efficiently. For instance, consider the following subrange:

type
  TYear=1900-2000;

Variables of type TYear will saved as 16bit quantities. As already discussed, 16bit variables are particularly slow.

Optimization Techniques

Play around with adding temporary variables to split up complex expressions

Typically, cramming everything into a single expression is the best way to optimize, but not always. At some point, the expression will become so complex that the compiler will be forced to break it up on its own. But frequently you can do a better job than the compiler. Try it!

Integer multiplication

Prior to the Pentium II, integer multiplication was quite expensive. With the arrival of the Pentium II however, integer multiplication has dropped down to the same one-cycle execution time as most other instructions. Additionally, the compiler will avoid doing multiplication by a constant if the same can be accomplished by utilizing addition, shifting and the lea instruction (mentioned below). Thus, you need to take your target processor into account when choosing whether to use multiplication or use some other equivalent method.

Comparison of a variable against multiple ordinal constants

This topic sounds heavy but only boils down to statements like these:

if (x > = 0) and (x < = 10) then
  DoSomething;

if (((c > = 'a') and (c < = 'z')) or 
    ((c > = '0') and (c < = '9'))) then
  DoSomething;

In each case, there is only a single variable and it is compared to multiple constants. It may be slightly more efficient and arguably clearer when the above code is expressed as:

if x in [0..10] then
  DoSomething;

if c in ['0'..'9', 'a'..'z'] then
  DoSomething;

The improvement in efficiency depends upon the likelihood of, for instance x bein within the range versus being out of the range. If it is more likely to be within the range then the set notation is better. Efficiency of the set notation increases as the number of sub-ranges increases. However, there is an inherent limitation on sets that limit them to 256 elements. This restricts this usage to values between 0 and 255 for integer types. For the full range of integers you can use this notation:

case x of
  0..10: DoSomething;
end;

case c of 
  'a'..'z',
  '0'..'9' : DoSomething;
end;

which produces equivalent code, but is not as elegant.

Advanced note: This operation may use an additional CPU register.

`movzx` vs `xor/mov`

A common requirement is to load values smaller than 32 bits in to a register. Since they do not overwrite the entire register it is necessary to zero out the register first. Alternatively, you can use the built in instruction movzx (move with zero extend). On Pentiums and before this instruction was slower than using xor reg,reg/mov reg,{value}. However, The PII has streamlined this instruction so that now it is prefered over the xor/mov combination. Note that the compiler chooses between these two options based on a set of rules that is apparently fairly complicated, as I have yet to figure them out.

Utilizing the LEA assembler instruction

There is an assembler instruction called LEA (Load Effective Address) that can do a couple operations at once. The only way to consciously take advantage of this instruction in Delphi is with array notation. For it to be fully effective, the array variable itself must be used again after the desired "trick" location. For example the following snippet is from a routine that calculates the length of a Pchar (StrLen) string (i.e. the position of the first #0 character). A total of four characters are processed at a time. Notice the usage of q in the calculation of r2.

function StrLenPas(tStr: PChar): integer; 
var
  p: ^Cardinal; 
  q: PChar; 
  bytes, r1, r2: Cardinal; 
begin 
...
  q := PChar(p^); // load 4 characters into q 
  r2 := Cardinal(@q[-$01010101]); // subtract 1 from each char (utilizing LEA) 
  r1 := Cardinal(q) xor $80808080; // check top bit (q must be used again) 
  bytes := r1 and r2; // distinguish between chars>127 and zero.     
  inc(p); 
... 
end;

Performance of large integer types

If you need to work with integers larger than can fit in longint, you have a couple of options (int64, comp, double, and extended). Three of these types are actually floating point types. Consequently they are not completely interchangeable. Only the relatively new int64 is completely handled as an integer. comp is sort of a hybrid in that it is stored as an 8 byte integer (the same as int64) but all operations are performed as floating point. Borland has officially designated comp as obsolete, and instead favors int64. However, as can be seen in the following table, comp enjoys a substantial performance lead in some circumstances. Extended and double can also be used to operate on large intergers although care must be taken to ensure that the lack of periodic rounding doesn't accumulate to the point of changing the answer. Shown below are the measured CPU Pentium II cycles for each operation with random values in the range (0 < x < 2^63). "Ovhd" refers to the overhed associated with making a function call and assignment with this type. LongInt is included for comparison purposes only.

              ovhd     add      mult     div
     Longint    2      1        1        4.7
     Comp      40      4.3      4.4     34
     int64     19      2.6     26.2    804
     double    25      3.1      1.3     35.8
     extended  43      4.1      3.2     34.4

Note: that the out-of-order execution capabilities of the Pentium II make precise timing measurements nearly impossible for individual operations. Consequently, the cycle counts shown above should only be considered approximate.

Aside from the horrid division performance for int64 there is no obvious best choice. Of the three Floating point based types double is best but it actually has slightly fewer digits (15 vs 19). int64 is better for addition than comp but worse otherwise.

So what to do. Well the best answer is to blend a bit. Use int64 as the base type. Division is easily handled by using trunc(Int64A/Int64B) instead of Int64A div Int64B. Getting the best performance is somewhat more complicated. Since comp and int64 have the same format, converting between formats is free. Using this you can force what would have been an integer based multiplication to a floating point one. This is shown below:

     var
       A,B,C,D:int64;
       CA:comp absolute A;
       CC:comp absolute C;
     begin
	   // Result:=A*B*C*D;  // Original expression
	   Result:=round(CA*B*C*D);  // blended version
	 end;

The usage of CA above forces the calculation to be done in floating point. Note that only one floating point type is needed to force the entire term into floating point. However, if there are multiple terms they each need a floating point variable: Result:=round(CA*B+CC*D). Also note that round is used instead of trunc as was used in division. while it would be possible to convert back into integer within an expression, it will typically not result in a speed increase unless two or three additions can be done in for each round.

Home Fundamentals Guide Code Links Tools Feedback

Integer Optimization Guidelines

Contents

Style Guide

32bit variables

Subrange types

Optimization Guide

temporary variables

Integer Multiplication

Conditionals

zero extending

asm LEA instruction

large integer types

Style Guidelines

Use 32 bit variables whenever possible

Avoid ordinal subranges

Optimization Techniques

Play around with adding temporary variables to split up complex expressions

Integer multiplication

Comparison of a variable against multiple ordinal constants

`movzx` vs `xor/mov`

Utilizing the LEA assembler instruction

Performance of large integer types

Home Fundamentals Guide Code Links Tools Feedback

Home Fundamentals Guide Code Links Tools Feedback

Integer Optimization Guidelines

Contents

Style Guidelines

Use 32 bit variables whenever possible

Avoid ordinal subranges

Optimization Techniques

Play around with adding temporary variables to split up complex expressions

Integer multiplication

Comparison of a variable against multiple ordinal constants

movzx vs xor/mov

Utilizing the LEA assembler instruction

Performance of large integer types

Home Fundamentals Guide Code Links Tools Feedback

`movzx` vs `xor/mov`