This page describes my work on algorithms for extended-precision floating point.
The names "f107" and "f161" in the title are the typedef names I have chosen for these types. They reflect the amount of precision one expects to get from the datatypes 107 and 161 bits of mantissa respectively. For consistency I also declare typedefs "f24" and "f53" for the standard IEEE single and double.
In order to communicate clearly, the symbols ⊕, ⊖ and ⊗ will be used to represent the result (after rounding) of a machine's built-in hardware floating-point add, subtract and multiply operations. When +, - and × appear without a circle, I am referring to a normal exact operation.
The techniques described here have been in use at least as far back as 1971, when Dekker described them in his paper[1] that has been widely cited by later authors.
The core insight of the technique is that native hardware-supported operations can be used to quickly compute the exact value of error resulting from roundoff in an addition (or subtraction) operation. A corrolary to this insight is that a similar technique can be used to slice and combine the digits of partial sums and products, retaining any desired number of the most significant digits and discarding the rest.
The term overlap is used to refer to any digit(s) that have the same significance. For example, consider the addition 123+4.56. the digit "3" in 123 overlaps the digit "4" in 4.56, because both have a significance of 100. The same thing can also occur in binary digits.
The method is expressed concisely by Knuth([6], section 4.2.2, theorem C) with this formula:
u + v = (u ⊕ v) + ((u ⊖ u') ⊕ (v ⊖ v'')
where
u' = (u ⊕ v) ⊖ v
v'' = (u ⊕ v) ⊖ u'
This formula states that the sum on the left is always exactly equal to the one on the right. Each side of the equation has exactly one precise unrounded sum "+"; the right side also has several rounded sum and difference operations. In addition, there is no overlap in the two pieces being added on the right, although there might be overlap on the left. (I call this the Knuth Relation not because Knuth gets credit for inventing it, but because he is cited by many others in papers and in computer source code.)
A deconstruction, with an example, is useful to illustrate what is happening. For the example, imagine that we are using a decimal calculator that displays only 4 digits, and all calculations are rounded to 4 digits. Let u be 999.1 and v be 2.222. There are two digits of overlap; the sum would take six digits (4 plus 4 minus the two digits that overlap) but because of carry propagation the exact sum (which is 1001.322) requires seven.
u ⊕ v, the rounded sum produced by the hardware (In our example, this will be 1001)
u', the twice-rounded result of adding and then subtracting v from u. (In the example this will be 999.0)
v'', the amount of v that is included in the rounded sum u ⊕ v. (In the example, this is 2.000)
u ⊖ u', the amount of u that was lost in the calculation of u ⊕ v. (In our example, this is .1000)
v ⊖ v'', the amount of v that was lost in the calculation of u ⊕ v. (In our example, this is .2220)
((u ⊖ u') ⊕ (v ⊖ v''), the exact amount of the roundoff that occurred in the calculation of u ⊕ v. (In our example, this is .3220)
Together, the two values (u ⊕ v) and ((u ⊖ u') ⊕ (v ⊖ v'') have the same exact sum as u + v.
Most of the cited authors prove that the formula correctly produces an exact result in all cases. There are three variations that need to be considered (depending on the sign of the roundoff and whether or not u is equal to u'), and I will leave it to the reader to discover the proof or consult the references.
The standard IEEE 754 64-bit floating-point type provides 53 bits of precision in the mantissa. When building a doubled-precision addition operation using the relation just described, we get 107 bits of mantissa precision one bit more than 2×53. To some this may be a surprise where does the extra bit come from?
The extra bit comes from the sign of the lower-precision component. To illustrate how it works, imagine that the upper component is limited to being one of the four values (4,5,6,7), and the lower component has four possible values plus a sign, or can be zero: (-4/8, -3/8, -2/8, -1/8, 0, 1/8, 2/8, 3/8, 4/8).
4 - 4/8
...
5 + 1/8
5 + 2/8
5 + 3/8
5 + 4/8
6 - 3/8
6 - 2/8
6 - 1/8
6 + 0
6 + 1/8
6 + 2/8
...
7 + 4/8
Counting up all the combinations, there are 33≅25 values that can be expressed. The sign bit contributes half of these values; without it the fractional part would have to go in increments of 1/4 and we would only get 16 or 17 values.
A practical application of the above formula is a set of operations that will take two single-precision quantities that might or might not overlap, and produce two single-precision quantities that do not overlap and whose precise sum is equal to the precise sum of the original two quantities.
This requires six rounded operations, and is called TWO-SUM in most of the literature. I call it add_112 for consistency with the names of other routines described below.
// add two doubles, return a two-component (f107) result.
add_112(a: f53; b: f53)
t0 = a ⊕ b
v = t0 ⊖ a
t1 = (b ⊖ v) ⊕ (a ⊖ (t0 ⊖ v))
return(t0, t1)
The above formula works for any pair of addends a and b, regardless of which is larger. If you already know which is larger, you can compute the result using only three operations instead of six. For reasons that will be more clear below, I call it normalize_22.
// renormalize two components, with two-component (f107) output.
normalize_22(t0: f53; t1: f53)
s0 = t0 ⊕ t1
s1 = t1 ⊖ (s0 ⊖ t0)
return(s0, s1)
SPLIT is another building block, used for multiplication. It is described by Priest[4],[5], and proven by Shewchuk[7]. It is possible to split a value into two parts that have less than or equal to half the precision, even when the original type has an odd number of bits because the lower half can be of opposite sign.
// split an f53 into two components hi and lo, each with 26 digits of
// precision (note that about half of the time, hi and lo will be of
// opposite sign)
split27(x: f53)
y = (227+1) ⊗ x
z = y ⊖ x
hi = y ⊖ z
lo = x ⊖ hi
return(hi, lo)
The size of the two parts hi and lo will be determined by the constant used in the first calculation. In general, it should be 2s+1 where s is half of the precision (rounded up). For f53, s is 27 as shown above. For f24 (IEEE single) s should be 12; the constant is 212+1, hi will have 12 bits and lo will have 11. For f64 (the IEEE extended or "long double" type supported by Pentium-compatible processors) s should be 32, hi will have 32 bits and lo will have 31.
References
[1] Dekker, T. (1971) A floating-point technique for extending the available precision. In Numerische Mathematik 18, 224-242.
[2] Wyatt, W.T., et. al. (1976) A portable extended precision arithmetic package and library with Fortran precompiler. ACM Trans. Math. Softw. 2(3), 209-231.
[3] Brent, R. (1978) A Fortran multiple-precision arithmetic package. ACM Trans. Math. Softw. 4(1), 57-70.
[4] Priest (1991) Algorithms for Arbitrary Precision Floating Point Arithmetic. Proc. 10th Symposium on Computer Arithmetic, IEEE Computer Society Press, Caif., 1991.
[5] Priest (1992) On Properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations.
[6] Knuth, D.E., The art of computer programming, vol. 2. Seminumerical algorithms. (Addison Wesley Longman, 1997 edition)
[7] Shewchuk (1997) Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates. Discrete and Computational Geometry 18(3), 305-363.
[8] Briggs, K. (1998) Doubledouble floating point arithmetic, web pages with source code and documentation. Originally at http://keithbriggs.info/software.html, now no longer online, but a mirror of the original 1998 version is at: The doubledouble homepage
[9] Li, X., et. al. (2000). Design, implementation and testing of extended and mixed precision BLAS. Lawrence Berkeley National Laboratory. Technical Report LBNL-45991
[10] Hida, Y., et. al. (2000) Quad-double arithmetic: algorithms, implementation, and application. Lawrence Berkeley National Laboratory, Technical Report LBNL-46996.
[11] Hida, Y., et. al. (2007) Library for double-double and quad-double arithmetic Similar to the 2000 paper, but removes the description of how to compile and test the library on different systems.
Robert Munafo's home pages on HostMDS (c) 1996-2010 Robert P. Munafo. about contact
This work is licensed under a Creative Commons Attribution 2.5
License. Details here
s.13