Style Guidelines
            
          Coding Style versus efficiency
           
          Optimization involves not only the speed of your code, but also the 
            speed with which you create and debug your code. This means that you 
            are not doing yourself any favors by creating fast but incomprehensible 
            code. Fortunately creating optimal code in Delphi rarely requires 
            ugly code. In fact, optimal code is often elegant code as well. Additionally, 
            within a given application it is likely that the same sort of techniques 
            will be used frequently. Thus, you can essentially set the coding 
            style to what gives the best performance when it matters.  
           
          Keep it simple
           
          When it comes to the Delphi optimizer, complexity kills. Keep routines 
            simple (no more than about 5 to 8 variables "in play"). 
            Do not do too much in any single loop. Overloading a loop causes variable 
            addresses, array indices etc. to be reloaded on each iteration. Loop 
            overhead is actually quite low so it is often advantageous to split 
            a complex loop into multiple loops or to move the innermost one or 
            two loops to a separate routine. Done properly, this will have the 
            added benefit of improving the readability of your code.  
           
          Strongly favor local variables
           
          Local variables are those declared within a routine in addition to 
            any parameters passed. Only local variables can be converted into 
            register variables, and register variables equal speed. Consequently, 
            it is often advantageous to copy data into a local variable prior 
            to using it. Typically this is most advantageous when the variable 
            is to be used within a loop. Thus, the overhead of copying is offset 
            by speedy reuse of the copied data. This optimization is particularly 
            useful if class members are used in a tight loop. Delphi tends to 
            load the pointer / class member just prior to its use within 
            the loop, adding a lot of unnecessary overhead.  
          There is one exception to this rule: arrays with elements of a simple 
            type. If you have an array of constant size and constant data, making 
            it global will save a register during calculations. Since saving a 
            single register is not worth a lot, this should only be used for constant 
            structures (conversion or transformation tables) where having a global 
            structure makes some sense to begin with.  
           
          Keep the number of parameters low
           
          Small but heavily used routines should not have more than three parameters, 
            as that is the maximum that can be passed by register. By following 
            this rule you maximize the use of registers and give the Delphi optimizer 
            a better chance to improve your code. Note that class methods have 
            a hidden parameter Self that is always passed implicitly 
            so for these only two parameters are left.  
           
          Do not use nested routines
           
          Nested routines (routines within other routines; also known as "local 
            procedures") require some special stack manipulation so that 
            the variables of the outer routine can be seen by the inner routine. 
            This results in a good bit of overhead. Instead of nesting, move the 
            procedure to the unit scoping level and pass the necessary variables 
            - if necessary by reference (use the var keyword) - or 
            make the variable global at the unit scope.  
           
          Pointer variables
           
          A valuable technique is to take advantage of pointers. A lot of programmers 
            shy away from pointers due to the potential for access violations, 
            memory leaks and other low-level problems. However, pointers are a 
            valuable tool in optimizing code in Delphi. Fortunately, this does 
            not mean you have to convert all your data references to pointers. 
            What it does mean is that you should take advantage of pointers as 
            temporary references into your data. These temporary variables will 
            typically be optimized into register variables. Consequently, you 
            really are not adding any machine code, you are only providing 
            a clue for the compiler that it needs to hold on to the intermediate 
            address. You can use pointers much the same way as you would use a 
            with statement. That is, use them to simplify complicated 
            or redundant addressing in complex data structures. In the case of 
            with statements, this is exactly what the compiler does 
            internally. For example:  
          
with Structure1.Structure2[i] do
begin
  ...
end;
 
          at the compiler-level becomes:  
          
InnerStructure := Structure1.Structure2[i];  // if it is a class
InnerStructure := @Structure1.Structure2[i];  // some other type
begin
  ...  // references to InnerStructure
end;
 
           
          Linked lists vs. Arrays
           
          Finding the trade-off between linked lists and arrays is a classic 
            design problem. On older computers (Pentium and before) integer multiplication 
            was a slow operation. Since multiplication is the key to accessing 
            arrays this shifted the performance balance towards Linked lists in 
            some cases. Is it random access or sequential access? Obviously, if 
            you truly need random access then an array is the way to go for anything 
            more than about 5 elements (this is a rule of thumb based on experimentation). 
            For sequential access or quasi-sequential access the short answer 
            is that arrays are better for simple element types and linked lists 
            are better for larger types.  
          Multiplication on the Pentium II is now much much faster. Consequently, 
            array access is always faster.  
           
          Types of Arrays
           
          In Delphi, arrays come in many flavors: Static, Dynamic, Pointer 
            and Open. Static arrays are the classic Pascal array type (A: array[0..100] 
            of Byte). Dynamic arrays are the new array type introduced with Delphi 
            4 (A: array of Byte). Pointer arrays are simply pointers to Static 
            arrays, however, the actual number of elements may not match that 
            of the array boundaries. Finally Open arrays look like dynamic arrays 
            but are exclusively used as parameters of routines. The underlying 
            implementation of all these arrays varies quite substantially. From 
            an efficiency viewpoint Static and Pointer arrays are the best choice, 
            followed by Open, then Dynamic arrays. However, Static arrays are 
            often too inflexible, and Pointer arrays can create painful management 
            issues. Fortunately, the various types are convertible. For arrays 
            without a fixed size, the current best choice is to manage 
            them as Dynamic arrays, and convert them to Pointer arrays as needed. 
           
          Dynamic arrays are much like huge strings (AnsiStrings) in that the 
            variable is actually a pointer to the array's first element. Thus, 
            converting a Dynamic array to a pointer arrays is simply an assignment. 
            While the length (size) of the Dynamic array is stored just before 
            the first element, using High or Length 
            on a Dynamic array generates a function call rather than some compiler 
            "magic" to extract the length - in fact, High 
            calls Length. Consequently, do not repeatedly get the 
            size of the array via these functions. Get it once (in the routine) 
            and save it. 
           Both Dynamic and Open arrays incur a fair amount of overhead when 
            used as parameters unless you use the const or var 
            modifiers. Note that, just like class parameters, a const 
            modifier on a Dynamic array only prevents you from changing the entire 
            array, not from modifying its content.  
          Let's finish with an example for converting the various array types: 
           
          
type  
  TDoubleArray = array of Double;
  TStaticDoubleArray = array[0..0] of Double;
  PDoubleArray =^ TStaticDoubleArray;
  
function Sum(const X: TDoubleArray): Double;  
var  
  P: PDoubleArray;  
  i: Integer;  
begin  
  P := Pointer(X);  
  Result:=0;
  for i := 0 to Length(X)-1 do  
    Result := Result + P[i];  
end;
           
          Exceptions
           
          Do not use exceptions just to jump out of a bit of code, or as a 
            catch-all on input errors. They add overhead with both the try..finally 
            block and with throwing the exception itself. Use the break, continue, 
            or exit statements to do unusual flow control, and validate inputs 
            (like pointers) as early as possible, but outside any loop.  
           
          Use type-casting rather than absolute
           
          A technique sometimes used to avoid typecasting is to "overlay" 
            a variable with another of a different type by using the absolute 
            keyword. However, this prevents the variable from becoming a "fast" 
            register variable. It is better to type-cast and save the original 
            variable into a new variable. For example:  
          
procedure DoSomething(s: PChar);
var
  ByteArray: PByteArray absolute s;
begin
  ...
 
          should be changed to or written as:  
          
procedure DoSomething(s: PChar);
var
  ByteArray: PByteArray;
begin
  ByteArray := PByteArray(s);
  ...
 
           
          Working with Sets
           
          There are two compiler magic functions called Include and exclude, 
            that are quite efficient for adding and subtracting single elements 
            form sets. Thus, you should use these instead of " s:=s+[a];" 
            sort of statement. In fact, it is efficient enough that a small number 
            of repeated uses of Include or exclude can still be better than the 
            above notation.  
           
          Pentium II specific bottlenecks
           
          It has occurred to me that while many of the techniques presented 
            here are based upon how Pentium II processors bottleneck, I have never 
            actually stated how this works. The long and detailed version can 
            be found in Intel's documentation and in Agner Fogs Pentium Optimization 
            Guide. Here I present a quickie version slanted towards Delphi's compiler 
            output. Having a general understanding of this process will may help 
            you decide what needed optimizing and what does not.  
          First off, the Pentium II is a superscalar pipelined processor with 
            out-of-order execution capabililities. Basically that means that each 
            instruction gets "executed" in steps and can march along 
            one of a few different channels. Specifically, each instruction has 
            to be loaded, executed and retired with the out-of-order buffer acting 
            as a sort of "waiting room" between the load and execution 
            steps. This seems simple enough, but the complications start to build 
            once the multiple channels part is added in, because not all channels 
            can handle all instructions. There are 3 loading channels, one of 
            which can take anything while the other two can only handle "simple" 
            instructions. There are 5 execution channels (called ports by Intel), 
            one is general purpose integer, another is general purpose integer 
            plus floating point, the third handles address math, and the fourth 
            and fifth load and store data. The retiring step also has 3 channels. 
            There is also the issue of latency. That is, many instructions take 
            longer than 1 cycle to execute.  
          So what does all this mean? Well, it means you can bottleneck in 
            a whole bunch of different ways. Basically, any channel of any step 
            can be a bottleneck if too many instructions that require that specific 
            unit are encountered. Thus, while the CPU can theoretically process 
            3 instructions per cycle, it may be limited to 2 or 1 or even less 
            due to the mix of instructions currently being executed. The out-of-order 
            "waiting room" helps with this situation by allowing instructions 
            not dependant on a currently executing operation to go around any 
            that may be waiting for a specific port and execute on a different 
            port. This helps with the small bottlenecks where there is a temporary 
            backup on a given port. However, it does nothing for the large scale 
            backups. For instance, executing a large series of floating point 
            operations, say a loop around a complex math expression, will typically 
            be constrained by the limitation of there being only one FP capable 
            port. Thus, the throughput drops to 1 instruction/cycle. Now the loop 
            also has some other overhead instuctions associated with it (incrementing 
            the loop variable and jumping) These instructions incur essentially 
            zero cycles since they can be fitted in around the backed up FP port. 
           
          In Pascal terms, this means that any tight, repetative operation 
            will probably be constrained by only one aspect of the operation. 
            It might be Floating point as described above, or it might be integer 
            math or memory addressing. In any case, the only optimizations that 
            will have any impact are those that go after that main aspect. Pruning 
            the ancillary code back will have no effect.  
           
          Inside the For statement
           
          Implementing For statements is one of the more complex 
            jobs the compiler has to deal with. This is in part, because the compiler 
            goes to great pain to avoid integer multiplication, which was quite 
            slow on CPUs before the Pentium II. Thus, For loops are 
            deconstructed into pseudocode that looks something like this:  
          Original Loop:  
          
  For i:=m to n do
  	A[i]:=A[i]+B[i];
 
          Becomes:  
          
  PA:=@A[m];
  PB:=@B[m];
  counter=
  m-n+1; ifcounter>0 then
    repeat
	  PA^:=PA^+PB^
	  inc(PA);
	  inc(PB);
	  dec(Counter);
	until counter=0;
          There are other configurations, but this is the most common, and 
            it is the one that causes problems. The problem stems from the fact 
            that the variable i appears nowhere in the deconstructed 
            version. However, when stepping through this code in the debugger, 
            watching i will show the value of the new variable most 
            similar to i, which is counter. This has 
            sent many a programmer into fits of hysteria thinking that their loop 
            is being executed backwards. It is not. The debugger is merely misinforming 
            you.  
          This example also illustrates the substantial overhead associated 
            with for loops. Notice that three variables need to be incremented 
            on each iteration, and that there is a fair amount of initialization 
            code. In some cases, this overhead is lessened. For instance if m 
            and n were compile-time constants, Or if m=0, and A and 
            B were already pointers that were not used again in the code, then 
            overhead code would be reduced.  
           
          Interfaces
           
          This is just a beginning pass at the performance implications of 
            using interfaces. Basicaly, an interface is implemented as a cross 
            between a string and a class. Interfaces, like strings are reference 
            counted which means every time you create or copy one, and every time 
            an interface variable goes out of scope there is some overhead. Thus, 
            treat interface variables more like you would string variables than 
            object variables. (Pass by const where possible, watch 
            out for using too many temp variables, etc.) Internally, interfaces 
            behave something like an object with all virtual methods only worse. 
            There are in fact two layers of indirection. So treat them accordingly. 
           
          Another note on the subject of interfaces: 
          Hallvard Vassbotn: For any interfaced global variable, the compiler 
            automatically adds another global variable that contains the address 
            of this first variable. When external units access the variable, it 
            is used through the implicit pointer variable. The reason for this 
            is to support packages and is done even if you are not using packages. 
            (See article in The Delphi Magazine issue 43 for more details.)  
           
          Optimization Techniques
            
          Keep an open mind
           
          Optimization is best approached as a top down issue. The most powerful 
            concept in optimization can be stated as "If it takes too long 
            to figure out the answer, then change the question." The best 
            improvements in performance will always come from changes made at 
            the design and algorithmic level. By the time you get down to coding 
            specifics, your options are quite limited. Unfortunately, like the 
            rest of design, it is rather difficult to break down this high-level 
            optimization into a nice set of rules. Nonetheless, the first thing 
            to do if performance needs improvement, is to look at the complete 
            problem at hand, starting optimization at the top and then working 
            down.  
           
          Time your code
           
          Timing code is generally called "profiling". If you want 
            to improve the performance of your code, you first need to know precisely 
            what that performance is. Additionally, you need to re-measure 
            with each change you apply to your code. Do not spend a single second 
            twiddling code to improve performance until you have analytically 
            determined exactly where the application is spending its time. I cannot 
            emphasize this enough.  
           
          Code alignment
           
          Be aware that the exact positioning of your code and its layout in 
            the executable module can affect its timing. The reason for this is 
            that there are penalties for jumping to "sub-optimal" address 
            alignments. There is very little you can do to influence this alignment 
            (this is a linker task), except to be aware of it. While it is possible 
            to insert spacing code into a unit, there is on guarantee that your 
            alignment efforts will be permanently rewarded since 32 bit Delphi 
            aligns only on 4 byte (DWord) boundaries. Thus, the next change to 
            any routine in any unit above the current routine may shift your code. 
            This problem can result in speed penalties as great as 30% in tight 
            loops. However, in more typical loops the problem is substantially 
            less. Also note that this can make timing code and therefore optimization 
            difficult since seemingly innocuous shifting in the code can affect 
            the performance. Consequently, if you make a change that you "know" 
            should increase performance but it does not, it may be the shifts 
            in code alignment are hiding improvements in your code.  
           
          Utilize the CPU window
           
          You do not need to be an assembler programmer to take advantage of 
            the CPU window. At the very least it will give you an idea of the 
            underlying complexity involved in each statement. Very often you can 
            estimate the effectiveness of a particular optimization technique 
            by simply counting the number of instructions produced for a given 
            operation. For instance, many references to the ebp register (as in 
            mov eax,[ebp-$04]) within a loop are an indication that 
            variables are continually reloaded. These reloadings are unnecessary 
            and thus are a prime target for optimization.  
          In Delphi 4.0, the CPU window is readily accessed from the main menu. 
            However, both version 2.0 and 3.0 of Delphi also have hidden CPU windows. 
            To allow access to these you need to add an entry in the registry, 
            using the registry editor "RegEdit.exe":  
          
[HKEY_CURRENT_USER\Software\Borland\Delphi\2.0\Debugging]
"EnableCPU"="1"
[HKEY_CURRENT_USER\Software\Borland\Delphi\3.0\Debugging]
"EnableCPU"="1"
 
           
          Unroll small loops
           
          Loop unrolling is a classic optimization technique, and it can be 
            easily done in Delphi. However, it is only worth doing on fairly small 
            loops. Unrolling essentially consists of doing what was originally 
            multiple iteration operations within a single pass through the unrolled 
            code. This reduces the relative loop overhead. Since the branch prediction 
            mechanism on Pentium II CPUs does not perform well (read: causes penalties) 
            on very tight loops, unrolling might be beneficial there, too.  
          In Delphi, the best way to unroll is usually with a while 
            loop. For example:  
          
i := 0;
while i < Count do
begin
  Data[i] := Data[i] + 1;
  Inc(i);
end;
 
          becomes:  
          
i := 0;
while i < Count do
begin
  Data[i] := Data[i] + 1;
  Data[i+1] := Data[i+1] + 1;
  Inc(i, 2);
end;
 
          The downside to loop unrolling is that you have to worry about what 
            happens when Count is not divisible by the factor two. 
            You typically handle it this way:  
          
i := 0;
if Odd(Count) then
begin
  Data[i] := Data[i] + 1;
  Inc(i);
end;
while i < Count do
begin
  Data[i] := Data[i] + 1;
  Data[i+1] := Data[i+1] + 1;
  Inc(i, 2);
end;
 
          You can unroll by whatever factor you want. However, the diminishing 
            marginal return on unrolling by progressively larger values coupled 
            with the growing complexity of the code makes unrolling by more than 
            a factor of 4 rather uncommon.  
           
          Eliminate conditionals within loops
           
          It is common for there to be if statements within a 
            loop with the conditionals for the statement based on the loop index. 
            Frequently, these can be removed from the loop by unrolling the loop 
            or splitting the loop into two loops. An example of the former would 
            be when statements must be executed every other iteration. An example 
            of the latter would be when statements are executed on a specific 
            iteration.  
           
          Reduce the number of looping conditionals
           
          A common coding structure is to loop while some condition is true 
            and while the loop index is less than some value. If the loop is small 
            - and it often only consists of incrementing the loop index - then 
            the bulk of the loop's execution time is spent evaluating the loop 
            conditionals. It is sometimes possible to reduce the number of these 
            conditionals by making one condition happen when the other would have. 
            Take the example of scanning for the occurrence of a specific character 
            in a string:  
          
i := 1;
l := Length(s);
while ((i <= l) and (s[i] <> c)) do
  Inc(i);
...
 
          The two conditionals can be combined by placing the desired character 
            in the last position in the string:  
          
i := 1;
l := Length(s);
lastc := s[l];
s[l] := c;
while s[i] <> c do
  Inc(i);
s[l] := lastc;
...
 
          This results in nearly a 2x speed improvement. This optimization 
            requires some forethought to ensure that there is an empty space available 
            at the end of the data. Strings and PChars always have the null at 
            the end that can be used. Also this technique, may cause undesired 
            side-effects in a multi-threaded environment due to changing the data 
            being scanned. This technique also works well with unrolling, as it 
            often simplifies the problems associated with needing partial iterations. 
            See FindMax as an additional 
            example of this technique.  
           
          Make the default path, the "no-jump" path
           
          This technique dates way back.  However, there is still a glimmer 
            of truth in it. The original reason (jumping took a lot of time) really 
            isn't the problem now. The problem now is more related to code alignment 
            and branch prediction. On the alignment side, if you do not jump than 
            there is no problem with alignment.  On the branch prediction 
            side, a branch is not even considered for prediction until it is actually 
            taken once.  
          Take advantage of break, exit and continue
          These flow control statements are often derided as being "bad 
            programming". However, they do have their place, especially in 
            performance optimizing code. The need for these statements typically 
            arises when some condition is not determined until the middle 
            of a loop. Often they are avoided by adding boolean variables and 
            additional conditionals to transfer control. However, these additions 
            cost execution time, and can often make the code look more, rather 
            than less, complex.  
           
          Resorting to assembler
           
          Do not attempt to use assembler to improve performance on Pentium 
            II CPUs. This is somewhat controversial, but it is a pretty good rule 
            of thumb. The out-of-order execution capabilities of Pentium II class 
            CPUs pretty much eliminate any advantage you might gain by recoding 
            your algorithm in assembler. In several tests, I have found that assembler 
            vs. optimally coded Pascal rarely exceeds a 10% difference. There 
            are always exceptions to this rule (for instance this Alpha 
            Blending code) and even times where arguably assembler 
            is cleaner than Pascal, but the point is that you should not just 
            jump to assembler when code seems too slow.  
          On the other hand, Pentium CPUs can often benefit from assembler 
            coding. Improvement factors of around two are not uncommon. However, 
            optimal coding for the Pentium processor can easily result in rather 
            non-optimal code on other processors. This applies to floating point 
            code in particular. If you choose to pursue this path then study Agner 
            Fog's assembler optimization manual carefully.  
           
          For versus While loops
           
          Loops with a pre-determined number of iterations can be implemented 
            either with a For loop or a While loop. 
            Typically, a For loop would be chosen. However, the underlying 
            implementation of the For loop is not as efficient as 
            that of the While loop in some instances (See Inside 
            the For statement). If the contents of the loop involve 
            no arrays or only single-dimensional arrays with elements of sizes 
            1, 2, 4, or 8 bytes, the code generated for a While loop 
            will be more efficient and cleaner than for the comparable For 
            loop. On the other hand, multi-dimensional arrays or arrays with elements 
            of other sizes as those listed above are better handled by For 
            loops. It is often possible to convert one of the latter into the 
            former, typically by using pointers. This approach is likely to increase 
            the efficiency of the code.  
          Additionally, using a While loop appears to reduce the 
            complexity factor. Thus it may be possible to trade For 
            loop usage against splitting up a routine. An example of this is the 
            row reduction step of Gauss Elimination. 
            The optimal configuration with For loops is to factor 
            out the two innermost loops into a separate routine. With While 
            loops however, all three loops can be kept together.  
          Of course there has to be an exception. If the index is never used 
            within the loop then For is usually a better choice. 
            Also give For as shot if both loop bounds are compile-time 
            constants.  
          Note that for a While loop to be as efficient as possible, 
            the loop condition should be as simple as possible. This means that, 
            unlike For loops, you need to move any calculation of 
            the iteration count out of the While statement and into 
            a temporary variable.  
           
          Large Memory requirement problems
           
          On Pentium II class CPUs it is often the case that cache or memory 
            bottlenecks are the main optimization problem, especially if the data 
            set being manipulated is large. If this is the case, then an entirely 
            different strategy is in order. Focus on reducing the memory requirements 
            and on reducing the number of passes through the data. In other words, 
            pack it tight and do as much as possible with it before moving on. 
            This may be at odds with some of the other suggestions presented here, 
            so it is necessary to determine which factor is more rate limiting 
            by experimenting with different implementations and profiling these. 
            A good indication that a cache and/or memory bottleneck is dominating 
            is when apparent improvement in the code being executed does not increase 
            the performance.  
           
          Case statement optimization
           
          Case statements are implemented as follows: First,the 
            compiler sorts the list of enumerated values and ranges. This means 
            that the placement individual cases within the Case statement 
            is irrelevant. Next, The compiler uses a sort of binary comparison 
            tree strategy along with jump tables to test the cases. The decision 
            between jump table and comparison tree is based on the "density" 
            of the enumerated cases. If the density is sufficiently high a jump 
            table will be generated. If the density is too low then the list will 
            be split approximately in half (with ranges counting as 1 element 
            in the list rather than as the number of values spanned). The process 
            then starts over on each of the sub-branches. That is, density check 
            then either generate jump table or split. This continues until all 
            the cases are handled.  
          So what optimization possibilities exist? Basically, it boils down 
            to this. Case statements are quite well optimized, but 
            not perfect. The "splits" on the binary comparison tree 
            can come at awkward places. Consequently, if you have groups of consecutive 
            values interspersed with gaps, it is better to make each range of 
            consecutive values its own Case statement and then make 
            an overall Case statement with the underlying ranges 
            each being a single case. This works because ranges won't be split 
            but sequential cases will be. The impact of this is that it tends 
            to create jump tables covering each entire sub-range. An Example: 
           
          Before: 
          
  Case x of
    100 :DoSomething1;
    101 :DoSomething2;
    102 :DoSomething3;
    103 :DoSomething4;
    104 :DoSomething5;
    105 :DoSomething6;
    106 :DoSomething7;
    107 :DoSomething8;
    200 :DoSomething9;
    201 :DoSomething10;
    202 :DoSomething11;
    203 :DoSomething12;
    204 :DoSomething13;
    205 :DoSomething14;
    206 :DoSomething15;
    207 :DoSomething16;
    208 :DoSomething17;
    209 :DoSomething18;
    210 :DoSomething19;
  end;
          After: 
          
  Case x of
    100..107 : 
	  case x of
        100 :DoSomething1;
        101 :DoSomething2;
        102 :DoSomething3;
        103 :DoSomething4;
        104 :DoSomething5;
        105 :DoSomething6;
        106 :DoSomething7;
        107 :DoSomething8;
	  end;
	200..210 :
	  case x of
        200 :DoSomething9;
        201 :DoSomething10;
        202 :DoSomething11;
        203 :DoSomething12;
        204 :DoSomething13;
        205 :DoSomething14;
        206 :DoSomething15;
        207 :DoSomething16;
        208 :DoSomething17;
        209 :DoSomething18;
        210 :DoSomething19;
      end;
  end;
          Also, Case statements do not have any provision for 
            weighting by frequency of execution. If you know that some cases are 
            more likely to be executed than others. You can use this information 
            to speed the execution. The way to achieve this is to cascade if 
            and Case statements to prioritize the search order. An 
            Example:  
          Before: 
          
  Case x of
    100 :DoSomething1;
    101 :DoSomethingFrequently2;
    102 :DoSomething3;
    103 :DoSomething4;
    104 :DoSomething5;
    105 :DoSomething6;
    106 :DoSomething7;
    107 :DoSomething8;
  end;
          After: 
          
  if x=101 then
    DoSomethingFrequently2
  else
    Case x of
      100 :DoSomething1;
      102 :DoSomething3;
      103 :DoSomething4;
      104 :DoSomething5;
      105 :DoSomething6;
      106 :DoSomething7;
      107 :DoSomething8;
    end;
           
          Moving and zeroing memory
           
          The built-in methods supplied with Delphi for moving memory and filling 
            it with zeros are Move and FillChar respectively. 
            These routines are based around the rep movsd and rep 
            stosd assembler instructions, which are fairly efficient. However, 
            there is some extra cleanup code associated with each of the routines 
            that can reduce their efficiency, especially when working on smaller 
            amounts of memory. Additionally, there are special data alignment 
            considerations on Pentium II CPU's that can have a substantial effect. 
           
          The first pass solution to these issues is simply to use a plain 
            loop to do the task. This is especially effective if the data elements 
            being handled are 32 or 64 bit or the structure involved is only partially 
            zeroed/moved (e.g. sub-section of a matrix). However, the loop approach 
            is less effective for arrays of large records or for smaller elements 
            like byte or word. You can unroll the loop and use typecasting to 
            further improve the situation but this complicates the code substantially 
            and only results in a small improvement.  
          At this point it is time to start looking at a specialized routine. 
            The first issue is the boundary size of the structure. Working with 
            32bit quantities is always the fastest approach. However, if the structure 
            is not evenly divisible by 4 bytes this can be a problem. The solution 
            used by Move and FillChar is to round down 
            the size to the nearest dword (4 byte) boundary, then copy the remainder 
            separately. As was mentioned, this extra overhead can be costly on 
            smaller structures. However, many structure are in fact evenly divisible 
            even if they do not at first appear to be. All memory allocations 
            are rounded up to the next dword boundary. Thus a string of 2 characters 
            is really 4 bytes long. It is usually faster to copy or zero this 
            extra data than to avoid it. Obviously care must be taken when using 
            this shortcut and it should be well documented.  
          Dealing with the data alignment issue is more complicated, and more 
            relevant only on larger structures. I will skip the description and 
            simply show the code:  
          
procedure ZeroMem(A:PDataArray; N:integer);
var
  i,c:integer;
  B:PDataArray;
begin
  B:=Pointer((integer(A)+15) and $FFFFFFF0);
  c:=integer(@A[N])-integer(B);
  fillChar(A^,N*SizeOf(TData)-c,#0);
  fillChar(B^,c,#0);
end;
 
          This will align on a 16 byte boundary by skipping over some of the 
            data. Of course the skipped part must be properly dealt with, thus 
            the two calls to fillChar. Obviously, this is not the fastest 
            approach since you have now got the overhead of fillchar*2, but it 
            does illustrate the technique. For maximum speed, this is one of those 
            cases where you have to resort to assembler.  
          
procedure ZeroMem32(P:Pointer;Size:integer);
// Size=number of dword elements to fill
// assumes that Size>4
asm
  push edi
  mov ecx,edx
  xor edx,edx
  mov dword ptr [eax],edx
  mov dword ptr [eax+4],edx
  mov dword ptr [eax+8],edx
  mov dword ptr [eax+12],edx
  mov edx,eax
  add edx,15
  and edx,-16
  mov edi,edx
  sub edx,eax
  shr edx,2
  sub ecx,edx
  xor eax,eax
  rep stosd
  pop edi
end;
 
          The move version is very similar. The Dest pointer is the one aligned: 
           
procedure MoveMem32(Src,Dest:Pointer;Size:integer);
// Size=number of dword elements to fill
// assumes that Size>4
asm
  push edi
  push esi
  push ebx
  mov ebx,[eax]
  mov [eax],ebx
  mov ebx,[eax+4]
  mov [eax+4],ebx
  mov ebx,[eax+8]
  mov [eax+8],ebx
  mov ebx,[eax+12]
  mov [eax+12],ebx
  mov ebx,edx
  add ebx,15
  and ebx,-16
  mov edi,ebx
  sub ebx,edx
  shr ebx,2
  sub ecx,ebx
  lea esi,[eax+4*ebx]
  rep movsd
  pop ebx
  pop esi
  pop edi
end;
 
           
          Global Data revisited
           
          There is a case where using a global structure is especially advantageous, 
            namely two-dimensional (or rather double-indexed) arrays with elements 
            of a simple type and where access is non-sequential for both indices. 
            By making the data structure global, both indices can be applied simultaneously 
            to access the structure, thereby avoiding additional instructions 
            to combine the indices.  
           
          While loops revisited
           
          An additional technique that can be applied to While 
            loops operating on arrays that saves on CPU registers is to shift 
            around all the references to the index variable so that you can count 
            from a negative number toward zero. This frees the register that would 
            have been needed to hold the iteration count. For example:  
          
  i := 0;
  while i < Count do
  begin
    Data[i] := Data[i] + 1;
    Inc(i);
  end;
          becomes:  
          
type
  TRef = array[0..0] of TheSameThingAsData;
  PRef = ^TRef;
var 
  Ref: PRef;
... 
  Ref := @Data[Count];
  i := -Count;  // Assign NEGATIVE count here
  while i < 0 do  // and count UP to zero
  begin
    Ref[i] := Ref[i] + 1;
    Inc(i);
  end;
           
          Pointer variables revisited
           
          In addition to the reduced dereferencing discussed above, using a 
            pointer variable can also serve to increase the "priority" 
            of an already existing pointer variable. In the example shown below, 
            taken from a Sub-String replacement routine, 
            the PChar variable pSub1 was being reloaded within the 
            loop (this could be seen in the CPU Window). By assigning it to pTemp 
            and then using pTemp within the loop, the loading was 
            shifted outside the loop, saving instruction cycles.  
          
pTemp := pSub1;  // increases "priority" of pSub1
while iStr[k] = pTemp[k] do
  Inc(k);
 
           
          Avoid checking methods pointers with assigned
           
          Checking method pointers for nil is a common operation typically 
            associated with calling events. Unfortunately, if assigned(FSomeEvent) 
            then ... does a 16bit compare of the high word of the code 
            address for the method pointer. This is rather odd and completely 
            unnecessary, and I can only guess that it is some sort of holdover 
            from 16bit Delphi 1. The workaround is to check the code address directly 
            ( if assigned(TMethod(FSomeEvent).code) then .... This 
            is a bit ugly and so you may only want to follow it in particularly 
            time critical sections. 
           
          Controlling the size of enumerated types
           
          If you use enumerated types (such as TSuits=(Diamonds,Hearts,Clubs,Spades) 
            ) include the {$MinEnumSize 4} (or {$Z4}) 
            directive to force all enum variables to be 32bit. If you have compatibility 
            issues you can simply turn it on for the type declarations of interest. 
            For instance:  
          
type
{$Z4}
  TSuits=(hearts,clubs,diamonds,spades);
{$Z1}
          Utilization of this directive is especially effective for enumerated 
            types greater than 256 elements. These result in word sized variables 
            which are quite slow.  
           
          Virtual methods
           
          It should not be surprising that virtual methods incur more overhead 
            than static methods. Calling a virtual method requires two pointer 
            dereferences and an indirect call which, the method has a couple of 
            parameters approximately doubles the total call overhead. However, 
            there is the potential for much more severe penalties. The indirect 
            call can suffer from what amounts to branch misprediction which is 
            a fairly stiff penalty on Pentium II processors. The penalty is incured 
            each time the target of the call changes. Thus, calling virtual method 
            within a loop where the method might change on every iteration could 
            see a substantial number of penalties. The workaround is essentially 
            to sort the method calls.  
          Example: 
          
  TBaseClass=class
  public
    procedure VMethod; virtual;
    procedure SMethod;  
  end;
  TDerivedClass=class(TBaseClass)
    procedure VMethod; override;
  end;
  TDerived2Class=class(TBaseClass)
    procedure VMethod; override;
  end;
implementation
type 
  TArray=array[0..100] of TBaseClass;
 
procedure DoStuff;
var
  b: integer;
  j: integer;
  A:TArray;
begin
  A[0]:=TBaseClass.Create;
  b:=0;
  for j := 1 to 99 do    
  begin
    b:=(1+random(2)+b) mod 3;  // mix em up
    case b of    
      0: A[j]:=TBaseClass.Create;
      1: A[j]:=TDerivedClass.Create;
      2: A[j]:=TDerived2Class.Create;
    end;    
  end;
  for j := 0 to 99 do
  	A[j].VMethod;
  for j := 0 to 99 do    
  	A[j].SMethod;
end;
           Sorting the calls is somewhat complicated, an example is shown below: 
           
          
Type
  TSomeVirtualMethod=procedure of object;
  TSomeMethodArray=array[0..100] of TSomeVirtualMethod;
var
  SomeMethodArray:TSomeMethodArray;
//    Initialization pass
  for i:=0 to Count-1 do
    SomeMethodArray[i]:=Item[i].SomeVirtualMethod;
//  Do something passes
  for i:=0 to Count-1 do
    SomeMethodArray[i];
           This, by itself, saves an underwhelming 1 clock cycle per call, 
            but you can sort the array by the code that's called (using TMethod) 
            to minimize the oh so painful branch prediction failure that can really 
            dominate this kind of method calling. Additionally, if the base class 
            method is some sort of do nothing routine it could be eliminated from 
            the procedure list entirely.  
          
// initialization pass
for i:=0 to Count-1 do
begin
  Hold:=ClassArray[i].SomeVirtualMethod;
  if TMethod(Hold).Code<>@TBaseClass.SomeVirtualMethod then
  begin
    j:=0;
    while (j>ArrayCount) and (longint(TMethod(Hold).Code)<SomeMethodArray[j]) do
      inc(j);
    for k:=ArrayCount-1 to j do
      SomeMethodArray[k+1]:=SomeMethodArray[k];
    SomeMethodArray[j]:=Hold;
    inc(ArrayCount);
  end;
end;
           This obviously isn't for the faint of heart, and is only useful 
            in certain situations, but it could be a big time saver in cases where 
            there are many objects, but only a few versions of the method and 
            the method is relatively small or empty.  
         |