| F107 and F161 High-Precision Floating-Point Data Types |
This page describes my work on algorithms for extended-precision floating point.
The names "f107" and "f161" in the title bar are the typedef names I have chosen for these types. They reflect the amount of precision one expects to get from the datatypes 107 and 161 bits of mantissa respectively. For consistency I also declare typedefs "f24" and "f53" for the standard IEEE single and double.
In order to communicate clearly, the symbols ⊕, ⊖ and ⊗ will be used to represent the result (after rounding) of a machine's built-in hardware floating-point add, subtract and multiply operations. When +, - and × appear without a circle, I am referring to a normal exact operation.
The techniques described here have been in use at least as far back as 1971, when Dekker described them in his paper1 that has been widely cited by later authors.
The core insight of the technique is that native hardware-supported operations can be used to quickly compute the exact value of error resulting from roundoff in an addition (or subtraction) operation. A corrolary to this insight is that a similar technique can be used to slice and combine the digits of partial sums and products, retaining any desired number of the most significant digits and discarding the rest.
Overlap
The term overlap is used to refer to any digit(s) that have the same significance. For example, the digit "3" in 123 overlaps the digit "1" in 1.23, because both have a significance of 100.
The Knuth Relation
The method is expressed concisely by Knuth7 with this formula:
u + v = (u ⊕ v) + ((u ⊖ u') ⊕ (v ⊖ v'')
where
u' = (u ⊕ v) ⊖ v
v'' = (u ⊕ v) ⊖ u'
This formula states that the sum on the left is always exactly equal to the one on the right. Each side of the equation has exactly one precise unrounded sum "+"; the right side also has several rounded sum and difference operations. In addition, there is no overlap in the two pieces being added on the right, although there might be overlap on the left. (I call this the Knuth Relation not because Knuth gets credit for inventing it, but because he is cited by many others in papers and in computer source code.)
A deconstruction, with an example, is useful to illustrate what is happening. For the example, imagine that we are using a decimal calculator that displays only 4 digits, and all calculations are rounded to 4 digits. Let u be 999.1 and v be 2.222. There are two digits of overlap; the sum would take six digits (4 plus 4 minus the two digits that overlap) but because of carry propagation the exact sum (which is 1001.322) requires seven.
u ⊕ v, the rounded sum produced by the hardware (In our example, this will be 1001)
u', the twice-rounded result of adding and then subtracting v from u. (In the example this will be 999.0)
v'', the amount of v that is included in the rounded sum u ⊕ v. (In the example, this is 2.000)
u ⊖ u', the amount of u that was lost in the calculation of u ⊕ v. (In our example, this is .1000)
v ⊖ v'', the amount of v that was lost in the calculation of u ⊕ v. (In our example, this is .2220)
((u ⊖ u') ⊕ (v ⊖ v''), the exact amount of the roundoff that occurred in the calculation of u ⊕ v. (In our example, this is .3220)
Together, the two values (u ⊕ v) and ((u ⊖ u') ⊕ (v ⊖ v'') have the same exact sum as u + v.
Most of the cited authors prove that the formula correctly produces an exact result in all cases. There are three variations that need to be considered (depending on the sign of the roundoff and whether or not u is equal to u'), and I will leave it to the reader to discover the proof or consult the references.
TWO-SUM and add_112
A practical application of the above formula is a set of operations that will take two single-precision quantities that might or might not overlap, and produce two single-precision quantities that do not overlap and whose precise sum is equal to the precise sum of the original two quantities.
This requires six rounded operations, and is called TWO-SUM in most of the literature. I call it add_112 for consistency with the names of other routines described below.
// add two doubles, return a two-component (f107) result.
add_112(a: f53; b: f53)
t0 = a ⊕ b
v = t0 ⊖ a
t1 = (b ⊖ v) ⊕ (a ⊖ (t0 ⊖ v))
return(t0, t1)
QUICK-TWO-SUM and normalize_22
The above formula works for any pair of addends a and b, regardless of which is larger. If you already know which is larger, you can compute the result using only three operations instead of six. For reasons that will be more clear below, I call it normalize_22.
// renormalize two components, with two-component (f107) output.
normalize_22(t0: f53; t1: f53)
s0 = t0 ⊕ t1
s1 = t1 ⊖ (s0 ⊖ t0)
return(s0, s1)
SPLIT and split27
SPLIT is another building block, used for multiplication. It is described by Priest7,8, and proven by Shewchuk9. It is possible to split a value into two parts that have less than or equal to half the precision, even when the original type has an odd number of bits because the lower half can be of opposite sign.
// split an f53 into two components hi and lo, each with 26 digits of
// precision (note that about half of the time, hi and lo will be of
// opposite sign)
split27(x: f53)
y = (227+1) ⊗ x
z = y ⊖ x
hi = y ⊖ z
lo = x ⊖ hi
return(hi, lo)
The size of the two parts hi and lo will be determined by the constant used in the first calculation. In general, it should be 2s+1 where s is half of the precision (rounded up). For f53, s is 27 as shown above. For f24 (IEEE single) s should be 12; the constant is 212+1, hi will have 12 bits and lo will have 11. For f64 (the IEEE extended or "long double" type supported by Pentium-compatible processors) s should be 32, hi will have 32 bits and lo will have 31.
1 :
Dekker, T. (1971) A floating-point technique for extending
the available precision. In Numerische Mathematik 18, 224-242.
2 :
Wyatt, W.T., et. al. (1976) A portable
extended precision arithmetic package and library with Fortran
precompiler. ACM Trans. Math. Softw. 2(3), 209-231.
3 :
Brent, R. (1978) A Fortran multiple-precision arithmetic
package. ACM Trans. Math. Softw. 4(1), 57-70.
4 :
Briggs, K. (1998) Doubledouble floating point arithmetic.
http://keithbriggs.info/software.html
5 :
Li, X., et. al. (2000). Design, implementation and testing
of extended and mixed precision BLAS. Lawrence Berkeley National
Laboratory. Technical Report LBNL-45991
6 :
Hida, Y., et. al. (2001) Quad-double arithmetic:
algorithms, implementation, and application. Lawrence Berkeley
National Laboratory, Technical Report LBNL-46996.
7 :
Priest (1991) Algorithms for Arbitrary Precision Floating
Point Arithmetic. Proc. 10th Symposium on Computer Arithmetic, IEEE
Computer Society Press, Caif., 1991.
8 :
Priest (1992) On Properties of Floating Point Arithmetics:
Numerical Stability and the Cost of Accurate Computations.
9 :
Shewchuk (1997) Adaptive Precision Floating-Point Arithmetic
and Fast Robust Geometric Predicates. Discrete and Computational
Geometry 18(3), 305-363.
© 1996-2008 Robert P. Munafo.
Email the author
This work is licensed under a
Creative Commons Attribution 2.5 License.
Back to my main page
s.13