Matrix-Vector multiplication JIT compiler

Committed to daNeuralNet a first working version of a JIT for matrix-vector multiplication that relies on the FMA instruction set (Fused Multiply and Addition).

This version generates code that is up to twice faster than the OpenBLAS for matrix sizes up to CPU cache size (100×100 to 200×200 usually), and maintains a marginal lead for larger sizes, though those are bound by memory bandwidth. The performance profile is similar on both AMD and Intel CPUs.


SamplingProfiler 64 – test version

A test version of SamplingProfiler 64bit is available here (3.2 MB).

It has only been tested with 64bit binaries compiled by Delphi 10.3 and detailed map files. It should work with other Delphi version, (TD32 and other debug information formats have not been tested yet).

There other known issues with stack traces from DLLs, so it is rough around the edges but should be functional.