- Introduction
- Compiling and installing the library
- Math library reference
- DFT library reference
- Other tools included in the package
- Benchmark results
- Additional notes

Fig. 7.1 shows a simplified code of our dispatcher. There is only
one exported function **mainFunc**. When
**mainFunc** is called for the first
time, **dispatcherMain** is called internally,
since *funcPtr* is initialized to the pointer to
**dispatcherMain**(line 14). It then detects if the
CPU supports SSE 4.1(line 7), and
rewrites *funcPtr* to a pointer to the function
that utilizes SSE 4.1 or SSE 2, depending on the result of CPU
feature detection(line 10). When
**mainFunc** is called for the second time, it does
not execute the
**dispatcherMain**. It just executes the function
pointed by the pointer stored in *funcPtr* during
the execution of
**dispatcherMain**.

There are a few advantages in our dispatcher. The first advantage is that it does not require any compiler-specific extension. The second advantage is simplicity. There are only 18 lines of simple code. Since the dispatchers are completely separated for each function, there is not much room for bugs to get in.

The third advantage is low overhead. You might think that the overhead is one function call including execution of prologue and epilogue. However, since modern compilers eliminate redundant execution of the prologue, epilogue and return instruction, the actual overhead is just one jmp instruction. This is very fast since it is not conditional.

The fourth advantage is thread safety. There is only one variable
shared among threads, which is *funcPtr*. There are
only two possible values for this pointer variable. The first value
is the pointer to the **dispatcherMain**, and the
second value is the pointer to either **funcSSE2**
or **funcSSE4**, depending on the availability of
extensions. Once *funcPtr* is substituted with the
pointer to **funcSSE2**
or **funcSSE4**, it will not be changed in the
future. It is obvious that the code works in all the cases.

`static double (*funcPtr)(double arg);`

`static double dispatcherMain(double arg) {`

`double (*p)(double arg) = funcSSE2;`

`#if the compiler supports SSE4.1`

`if (SSE4.1 is available on the CPU) p = funcSSE4;`

`#endif`

`funcPtr = p;`

`return (*funcPtr)(arg);`

`}`

`static double (*funcPtr)(double arg) = dispatcherMain;`

`double mainFunc(double arg) {`

`return (*funcPtr)(arg);`

`}`

Fig. 7.1: Simplified code of our dispatcher

ULP stands for "unit in the last place", which is sometimes used for measuring accuracy of calculations. 1 ULP is basically the distance between the two closest floating point number, which depends on the exponent of the FP number. The accuracy of calculations by reputable math libraries is usually between 0.5 and 1 ULP. Here, the accuracy means the largest error of calculation, which only happens in the worst case. SLEEF math library provides multiple accuracy choices for some math functions. Many functions have 3.5-ULP and 1-ULP versions, and 3.5-ULP versions are significantly faster than 1-ULP versions. If you care more about execution speed than accuracy, it is advised to use the 3.5-ULP versions along with -ffast-math or "unsafe math optimization" options for the compiler.

In IEEE 754 standard, underflow does not happen abruptly when the exponent becomes zero. Instead, denormal numbers are produced which has less precision, and this is sometimes called gradual underflow. On some implementation which is not IEEE-754 conformant, flush-to-zero mode is used since it is easier to implement. In flush-to-zero mode, numbers smaller than the smallest normalized number cannot be represented, and it is replaced with zero. Because of this, the accuracy of calculation may be influenced in some cases. The smallest normalized precision number can be referred with DBL_MIN for double precision, and FLT_MIN for single precision. The naming of these macros is a little bit confusing because DBL_MIN is not the smallest double precision number.

The sincospi series of functions evaluates *sin(
π a )* and

C standards include specifications for functions that evaluate
trigonometric functions. In order to do calculations for evaluating
these functions, reduction of an argument is required. This involves
a multiple precision multiplication with *π*,
which requires many operations of addition and multiplication. This
is slow especially if accurate evaluation is required. By designing
the function in a way that the argument is pre-multiplied
by *π*, this reduction can be eliminated. This
leads to faster and more accurate evaluation.

The second reason is that sincospi functions are handy for
implementing an FFT library. FFT libraries need to evaluate
trigonometric functions for generating twiddle factors that is used in
the butterfly operations. Since the butterfly operations are
repeatedly applied, the error in twiddle factors accumulates. Thus, we
want to make the error in twiddle factors as small as possible. In an
FFT of power-of-two size, twiddle factors are
*sin( π m /
2^{n} )* where

The third reason is that sinpi is needed internally to implement gamma functions.

It is a soup ladle.