Delphi unit for OpenBLAS

Just create a new repository with a “LibCBLAS” unit meant to use the OpenBLAS library in its Windows 64bit incarnation from Delphi 10.3+

https://bitbucket.org/egrange/libcblas/

OpenBLAS is an optimized BLAS library (Basic Linear Algebra Subprograms), the DLL itself can be obtained from the “xianyi” repository where pre-compiled Windows DLL are maintained.

BLAS is a very low level library, and OpenBLAS strives for pure performance by packaging many implementations tailored to specific CPU architectures.

From quick initial benchmarks, unless you have very fast RAM or (or a server-class CPU), it may be preferable to force OpenBLAS to be single-threaded, as the implementations are efficient enough that one thread can be memory-limited already on regular desktop/laptop CPU.

This can be accomplished either before loading the library with

SetEnvironmentVariable('OPENBLAS_NUM_THREADS', '1');

or dynamically with

cblas.openblas_set_num_threads(1);

To illustrate the issue, below is a chart of the relative performance on single, dual and multi-threaded (as many cores as possible) on an E3-1240v6 with relatively fast memory. Chart values above 1 means multi-threaded computation is faster than single-threaded for that matrix size, values below means multi-threaded is slower.

Here its for blas_sgmev (vector by matrix multiplication), for a square matrix. The switch to a multi-core execution can result in a drastic drop due to threading overhead, which is only recouped for larger sizes, up to the point where the main memory bandwidth constraint end up limiting everything.

The exact shape of the curve will depend on you particular problem, your particular CPU, how fast the RAM is, single vs dual channel, etc. So it is an aspect to keep in mind and benchmark.

Just do not assume that because OpenBLAS is highly optimized with architecture-specific implementation, you can just call it and forget all about it 🙂

2 thoughts on “Delphi unit for OpenBLAS

Comments are closed.