Matrix-Vector multiplication JIT compiler
Committed to daNeuralNet a first working version of a JIT for matrix-vector multiplication that relies on the FMA instruction set (Fused Multiply and Addition).
This version generates code that is up to twice faster than the OpenBLAS for matrix sizes up to CPU cache size (100×100 to 200×200 usually), and maintains a marginal lead for larger sizes, though those are bound by memory bandwidth. The performance profile is similar on both AMD and Intel CPUs.
A test version of SamplingProfiler 64bit is available
Just created a new repository with a “LibCBLAS” unit meant to use the OpenBLAS library in its Windows 64bit incarnation from Delphi 10.3+