F107 and F161 High-Precision Floating-Point Data Types

This page describes my work on algorithms for extended-precision floating point.

Contents

Introduction

The Knuth Relation

Why 107 Bits and not 106=2×53 ?

Building Blocks for the 107-bit Operations

Source Code

Bibliography

Introduction

The names "f107" and "f161" in the title are the typedef names I have chosen for these types. They reflect the amount of precision one expects to get from the datatypes — 107 and 161 bits of mantissa respectively. For consistency I also declare typedefs "f24" and "f53" for the standard IEEE single and double.

In order to communicate clearly, the symbols ⊕, ⊖ and ⊗ will be used to represent the result (after rounding) of a machine's built-in hardware floating-point add, subtract and multiply operations. When +, - and × appear without a circle, I am referring to a normal exact operation. This use of symbols will probably be a bit easier for mathematicians than for computer programmers. (-:

The techniques described here have been in use at least as far back as 1971, when Dekker described them in a paper[1] that has been widely cited by later authors.

The core insight of the technique is that native hardware-supported operations can be used to quickly compute the exact value of error resulting from roundoff in an addition (or subtraction) operation. A corollary to this insight is that a similar technique can be used to slice and combine the digits of partial sums and products, retaining any desired number of the most significant digits and discarding the rest.

The technique is efficient enough to use on anything that already has some kind of floating-point, including the tiny CPU cores in a massively parallel GPU[13].

Overlap

The term overlap is used to refer to any digit(s) that have the same significance. For example, consider the addition 123+4.56. the digit "3" in 123 overlaps the digit "4" in 4.56, because both have a significance of 10⁰. The same thing can also occur in binary digits.

The Knuth Relation

The method is expressed concisely by Knuth([6], section 4.2.2, theorem C) with this formula:

u + v = (u ⊕ v) + ((u ⊖ u') ⊕ (v ⊖ v'')
where
u' = (u ⊕ v) ⊖ v
v'' = (u ⊕ v) ⊖ u'

This formula states that the sum on the left is always exactly equal to the one on the right. Each side of the equation has exactly one precise unrounded sum "+"; the right side also has several rounded sum and difference operations. In addition, there is no overlap in the two pieces being added on the right, although there might be overlap on the left. (I call this the Knuth Relation not because Knuth gets credit for inventing it, but because he is cited by many others in papers and in computer source code.)

A deconstruction, with an example, is useful to illustrate what is happening. For the example, imagine that we are using a decimal calculator that displays only 4 digits, and all calculations are rounded to 4 digits. Let u be 999.1 and v be 2.222. There are two digits of overlap; the sum would take six digits (4 plus 4 minus the two digits that overlap) but because of carry propagation the exact sum (which is 1001.322) requires seven.

u ⊕ v, the rounded sum produced by the hardware (In our example, this will be 1001)

u', the twice-rounded result of adding and then subtracting v from u. (In the example this will be 999.0)

v'', the amount of v that is included in the rounded sum u ⊕ v. (In the example, this is 2.000)

u ⊖ u', the amount of u that was lost in the calculation of u ⊕ v. (In our example, this is .1000)

v ⊖ v'', the amount of v that was lost in the calculation of u ⊕ v. (In our example, this is .2220)

((u ⊖ u') ⊕ (v ⊖ v''), the exact amount of the roundoff that occurred in the calculation of u ⊕ v. (In our example, this is .3220)

Together, the two values (u ⊕ v) and ((u ⊖ u') ⊕ (v ⊖ v'') have the same exact sum as u + v. Furthermore, these two parts have no "digits of overlap". They might have opposite sign.

Most of the cited authors prove that the formula correctly produces an exact result in all cases. There are three variations that need to be considered (depending on the sign of the roundoff and whether or not u is equal to u'), and I will leave it to the reader to discover the proof or consult the references.

Why 107 Bits and not 106=2×53 ?

The standard IEEE 754 64-bit floating-point type provides 53 bits of precision in the mantissa. When building a doubled-precision addition operation using the relation just described, we get 107 bits of mantissa precision — one bit more than 2×53. To some this may be a surprise — where does the extra bit come from?

The extra bit comes from the sign of the lower-precision component. To illustrate how it works, imagine that the upper component is limited to being one of the four values (4,5,6,7), and the lower component has four possible values plus a sign, or can be zero: (-4/8, -3/8, -2/8, -1/8, 0, 1/8, 2/8, 3/8, 4/8).

4 - 4/8
...
5 + 1/8
5 + 2/8
5 + 3/8
5 + 4/8
6 - 3/8
6 - 2/8
6 - 1/8
6 + 0
6 + 1/8
6 + 2/8
...
7 + 4/8

Counting up all the combinations, there are 33≈2⁵ values that can be expressed. The sign bit contributes half of these values; without it the fractional part would have to go in increments of 1/4 and we would only get 16 or 17 values.

Building Blocks for the 107-bit Operations

TWO-SUM and add_112

A practical application of the above formula is a set of operations that will take two single-precision quantities that might or might not overlap, and produce two single-precision quantities that do not overlap and whose precise sum is equal to the precise sum of the original two quantities.

This requires six rounded operations, and is called TWO-SUM in most of the literature. I call it add_112 for consistency with the names of other routines described below.

// add two doubles, return a two-component (f107) result.
add_112(a: f53; b: f53)
t0 = a ⊕ b
v = t0 ⊖ a
t1 = (b ⊖ v) ⊕ (a ⊖ (t0 ⊖ v))
return(t0, t1)

QUICK-TWO-SUM and normalize_22

The above formula works for any pair of addends a and b, regardless of which is larger. If you already know which is larger, you can compute the result using only three operations instead of six. For reasons that will be more clear below, I call it normalize_22.

// renormalize two components, with two-component (f107) output.
normalize_22(t0: f53; t1: f53)
s0 = t0 ⊕ t1
s1 = t1 ⊖ (s0 ⊖ t0)
return(s0, s1)

SPLIT and split27

SPLIT is another building block, used for multiplication. It is described by Priest[4],[5], and proven by Shewchuk[7]. It is possible to split a value into two parts that have less than or equal to half the precision, even when the original type has an odd number of bits — because the lower half can be of opposite sign.

// split an f53 into two components hi and lo, each with 26 digits of
// precision (note that about half of the time, hi and lo will be of
// opposite sign)
split27(x: f53)
y = (2²⁷+1) ⊗ x
z = y ⊖ x
hi = y ⊖ z
lo = x ⊖ hi
return(hi, lo)

The size of the two parts hi and lo will be determined by the constant used in the first calculation. In general, it should be 2^s+1 where s is at least half of the precision (in other words, divide the precision by 2, and round up if the precision is odd). For f53, s is 27 as shown above. For f24 (IEEE single) s should be 12; the constant is 2¹²+1, hi will have 12 bits and lo will have 11. For f64 (the IEEE extended or "long double" type supported by Pentium-compatible processors) s should be 32, hi will have 32 bits and lo will have 31.

Source Code

My f107 implementation in C++ includes:

the basic operations (+, -, ×, /)
faster versions of x², 2^N and x^N for integer N
absolute value, floor
square root
sine, cosine, tangent
conversion to/from ASCII string
exp, log, sinh, cosh, tanh

The source files are:

f107_o.cpp
f107_o.h

Higher Precisions

Here is source code for my f161 "triple-double" and f215 "quad-double" formats. These only have the four basic arithmetic functions (add, subtract, multiply, divide), plus an x² "square" function that is more efficient than x*x, and integer exponent. For the work that I do (drawing images of the Mandelbrot set), these are the only functions I have needed at these higher precisions.

f161_o.cpp
f161_o.h
f215_o.cpp
f215_o.h

Bibliography

[1] Dekker, T. (1971) A floating-point technique for extending the available precision. In Numerische Mathematik 18, 224-242.

[2] Wyatt, W.T., et al. (1976) A portable extended precision arithmetic package and library with Fortran precompiler. ACM Trans. Math. Softw. 2(3), 209-231.

[3] Brent, R. (1978) A Fortran multiple-precision arithmetic package. ACM Trans. Math. Softw. 4(1), 57-70.

[4] Priest (1991) Algorithms for Arbitrary Precision Floating Point Arithmetic. Proc. 10th Symposium on Computer Arithmetic, IEEE Computer Society Press, Caif., 1991.

[5] Priest (1992) On Properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations.

[6] Knuth, D.E., The art of computer programming, vol. 2. Seminumerical algorithms. (Addison Wesley Longman, 1997 edition)

See in particular Theorems B and C and exercise 21 in section 4.2.2 "Accuracy of Floating Point Arithmetic".

[7] Shewchuk (1997) Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates. Discrete and Computational Geometry 18(3), 305-363.

[8] Briggs, K. (1998) Doubledouble floating point arithmetic, web pages with source code and documentation. Originally at http://keithbriggs.info/software.html, now no longer online, but a mirror of the original 1998 version is at: The doubledouble homepage

[9] Li, X., et al. (2000). Design, implementation and testing of extended and mixed precision BLAS. Lawrence Berkeley National Laboratory. Technical Report LBNL-45991

[10] Hida, Y., et al. (2000) Quad-double arithmetic: algorithms, implementation, and application. Lawrence Berkeley National Laboratory, Technical Report LBNL-46996.

[11] Hida, Y., et al. (2007) Library for double-double and quad-double arithmetic Similar to the 2000 paper, but removes the description of how to compile and test the library on different systems. DD and QD implementation is available for download.

[12] Andrew Thall (2006). Extended-Precision Floating-Point Numbers for GPU Computation. one-page layout summary of [13] (for SIGGRAPH 2006) showing Mandelbrot calculations rendered in 48-bit precision on a GPU that only provides 23-bit precision. PDF here

[13] Andrew Thall (2007, amended 2009). Extended-Precision Floating-Point Numbers for GPU Computation. Adapts the double and quad algorithms from Shewchuk, Hida et al. to the Nvidia Cg language for GPUs (achieving roughly 96 bits of mantissa precision). Includes high-precision sqrt, sin, cos, ln and exp functions. PDF here

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2023 Apr 19.

s.27