SLEEF Documentation - Additional Notes

Table of contents

Additional Notes

How the dispatcher works

Fig. 7.1 shows a simplified code of our dispatcher. There is only one exported function mainFunc. When mainFunc is called for the first time, dispatcherMain is called internally, since funcPtr is initialized to the pointer to dispatcherMain(line 14). It then detects if the CPU supports SSE 4.1(line 7), and rewrites funcPtr to a pointer to the function that utilizes SSE 4.1 or SSE 2, depending on the result of CPU feature detection(line 10). When mainFunc is called for the second time, it does not execute the dispatcherMain. It just executes the function pointed by the pointer stored in funcPtr during the execution of dispatcherMain.

There are a few advantages in our dispatcher. The first advantage is that it does not require any compiler-specific extension. The second advantage is simplicity. There are only 18 lines of simple code. Since the dispatchers are completely separated for each function, there is not much room for bugs to get in.

The third advantage is low overhead. You might think that the overhead is one function call including execution of prologue and epilogue. However, since modern compilers eliminate redundant execution of the prologue, epilogue and return instruction, the actual overhead is just one jmp instruction. This is very fast since it is not conditional.

The fourth advantage is thread safety. There is only one variable shared among threads, which is funcPtr. There are only two possible values for this pointer variable. The first value is the pointer to the dispatcherMain, and the second value is the pointer to either funcSSE2 or funcSSE4, depending on the availability of extensions. Once funcPtr is substituted with the pointer to funcSSE2 or funcSSE4, it will not be changed in the future. It is obvious that the code works in all the cases.

static double (*funcPtr)(double arg);

static double dispatcherMain(double arg) {
    double (*p)(double arg) = funcSSE2;

#if the compiler supports SSE4.1
    if (SSE4.1 is available on the CPU) p = funcSSE4;
#endif

    funcPtr = p;
    return (*funcPtr)(arg);
}

static double (*funcPtr)(double arg) = dispatcherMain;

double mainFunc(double arg) {
    return (*funcPtr)(arg);
}

Fig. 7.1: Simplified code of our dispatcher

ULP, gradual underflow and flush-to-zero mode

ULP stands for "unit in the last place", which is sometimes used for measuring accuracy of calculations. 1 ULP is basically the distance between the two closest floating point number, which depends on the exponent of the FP number. The accuracy of calculations by reputable math libraries is usually between 0.5 and 1 ULP. Here, the accuracy means the largest error of calculation, which only happens in the worst case. SLEEF math library provides multiple accuracy choices for some math functions. Many functions have 3.5-ULP and 1-ULP versions, and 3.5-ULP versions are significantly faster than 1-ULP versions. If you care more about execution speed than accuracy, it is advised to use the 3.5-ULP versions along with -ffast-math or "unsafe math optimization" options for the compiler.

In IEEE 754 standard, underflow does not happen abruptly when the exponent becomes zero. Instead, denormal numbers are produced which has less precision, and this is sometimes called gradual underflow. On some implementation which is not IEEE-754 conformant, flush-to-zero mode is used since it is easier to implement. In flush-to-zero mode, numbers smaller than the smallest normalized number cannot be represented, and it is replaced with zero. Because of this, the accuracy of calculation may be influenced in some cases. The smallest normalized precision number can be referred with DBL_MIN for double precision, and FLT_MIN for single precision. The naming of these macros is a little bit confusing because DBL_MIN is not the smallest double precision number.

About sincospi

The sincospi series of functions evaluates sin( πa ) and cos( πa ) simultaneously. These functions are added to SLEEF as of version 3.0. There are a few reasons that I added these functions.

C standards include specifications for functions that evaluate trigonometric functions. In order to do calculations for evaluating these functions, reduction of an argument is required. This involves a multiple precision multiplication with π, which requires many operations of addition and multiplication. This is slow especially if accurate evaluation is required. By designing the function in a way that the argument is pre-multiplied by π, this reduction can be eliminated. This leads to faster and more accurate evaluation.

The second reason is that sincospi functions are handy for implementing an FFT library. FFT libraries need to evaluate trigonometric functions for generating twiddle factors that is used in the butterfly operations. Since the butterfly operations are repeatedly applied, the error in twiddle factors accumulates. Thus, we want to make the error in twiddle factors as small as possible. In an FFT of power-of-two size, twiddle factors are sin( πm / 2n ) where m and n are integer. If we just use the usual trigonometric functions defined in the C standards with the precision same as that used for butterfly operations, we already have error when calculating arguments, since πm / 2n cannot be represented as a floating point value without error. On the other hand, if we use sincospi function, the argument can be accurately represented by a radix 2 FP number. Thus, we can calculate twiddle factors with better accuracy.

The third reason is that sinpi is needed internally to implement gamma functions.

About the logo

It is a soup ladle.


logo
Fig. 7.2: SLEEF logo