mdbtxt1
mdbtxt2
Proceed to Safety

Survey of Floating-Point Formats    

This page gives a very brief summary of floating-point formats that have been used over the years. Most have been implemented in hardware and/or software and used for real work; a few (notably the small ones at the beginning) just for lecture and homework examples. They are listed in order of increasing range (a function of exponent size) rather than by precision or chronologically.

range (overflow value) precision bits B We Wm what
14 0.6 6 2 3 2 Used in university courses21,22
240 0.9 8 2 4 3 Used in university courses21,22
65504 = 215×(2-2-10) 3.3 16 2 5 10 IEEE 754-2008 binary16, also called "half", "s10e5", "fp16". 2-byte, excess-15 exponent used in nVidia NV3x and subsequent GPUs24,25,27,33; largest minifloat. Can approximate any 16-bit unsigned integer or its reciprocal to 3 decimal places.
9.9999×108 (see note32) 4.8 24 2 7 16 Zuse Z1, the first-ever implementation of binary floating-point31
9.999×109 (see note32) 4.8 24 2 7 16 Zuse Z3
2.81×1014 3.6 18 8 5 12 excess-15 octal, 4-digit mantissa. A fairly decent radix-8 format in an 18 bit PDP-10 halfword
9.22×1018 = 226-1 4.8 24 2 7 16 3-byte excess-63 ATI R3x0 and Rv350 GPUs20,33. Also called "s16e7", "fp24".
1.84×1019 = 226 6.9 30 2 7 23 AMD 9511 (1979)5
9.90×1027 = 882/2-1 5.1 24 8 6 17 Octal excess-3212
1.70×1038 = 227-1 7.2 32 2 8 1+23 Digital VAX F format, TRS-80 single-precision 35
1.70×1028 = 227-1 8.1 36 2 8 27 Digital PDP-101,18; Honeywell 600, 60001,16; Univac 110x single1; IBM 709x, 704x1
1.70×1038 = 227-1 9.6 40 2 8 1+31 Apple II, Sinclair ZX Spectrum 30, perhaps others
1.70×1038 = 227-1 16.9 64 2 8 1+55 Digital VAX D format, TRS-80 double-precision 35
3.40×1038 = 227 7.2 32 2 8 1+23 IEEE 754 single-precision (or IEE 754-2008 "binary32") (ubiquitous)
3.40×1038 = 227 7.2 32 2 8 1+23 Digital PDP-1119, PDP 166
9.99×1049 = 10102/2 8.0 44102d 8d Burroughs B2207
4.31×1068 = 876 11.7 47 8 7 39 Burroughs 5700, 6700, 7700 single1,14,16,17
7.24×1075 = 1663 7.2 3216 7 24 IBM 360, 3706; Amdahl 1; DG Eclipse M/6001
7.24×1075 = 1663 16.8 6416 7 56 IBM 360 double15
5.79×1076 = 2255 7.2 ? 2 9 24 Burroughs 1700 single16
1.16×1077 = 1664 7.2 3216 7 24 HP 30001
9.99×1096 = 103×25+1 7.0 32108- 7d IEEE 754r decimal32 3,4
9.99×1099 = 10102 10.0 ? 102d 10d Most scientific calculators
4.9×10114 = 8127 12.0 48 8 8 40 Burroughs 77006
9.99×10127 = 10064 ~13 641001 7 TI-99/4A computer36
8.9×10307 = 2210-1 14.7 60 211 1+48 CDC 6000, 66006, 7000 CYBER
8.9×10307 = 2210-1 15.9 64 211 1+52 DEC Vax G format
8.9×10307 = 2210-1 ? ? ? ? ? UNIVAC; 110x double1
1.8×10308 = 2210 15.9 64 211 1+52 IEEE 754 double-precision (or IEE 754-2008 "binary64") (nearly ubiquitous).
1.8×10308 = 2210 32.2128 211 107 "Double-double"40,41 based on IEEE binary64 (my "f107"39)
1.8×10308 = 2210 48.5192 211 161 "Trible-double" based on IEEE binary64 (my "f161"39)
1.8×10308 = 2210 67.2256 211 215 "Quad-double"41 based on IEEE binary64 (my "f215"39)
1.27×10322 = 21070 ? ? ? ? ? CDC 6x00, 7x00, Cyber1
9.99×10384 = 103×27+1 16.0 641010- 16d IEEE 754r decimal64 3,4
9.99×10499 = 10103/2 12.0 ? 103d 12d HP 71B13, 851 calculators
9.99×10999 = 10103 12.0 ? 103d 12d Texas Instruments 85, 92 calculators
9.99×10999 = 10103 14.0 ? 103d 14d Texas Instruments 89 calculator13
9.99×10999 = 10103 17.0 82103d 17d 68881 Packed Decimal Real (3 BCD digits for exponent, 17 for mantissa, and two sign bits)
1.4×102465 = 2213-3 7.238? 214 24 Cray C90 half8
1.4×102465 = 2213-3 14.161? 214 47 Cray C90 single8
1.4×102465 = 2213-3 28.8110? 214 96 Cray C90 double8
1.1×102466 = 2213 ? ? ? ? ? Cray I1
5.9×104931 = 2214-1 34.0128 2151+112 DEC VAX H format
1.2×104932 = 2214 19.2 80 215 64 The minimum IEEE 754-1985 double extended size (Pentium; HP/Intel Itanium; Motorola 68040, 68881, 88110)
1.2×104932 = 2214 34.0128 2151+112 IEEE 754-2008 binary128 aka "quad" or "quadruple"2,3,28,29,37 (DEC Alpha9; IBM S/390 G510)
9.99×106144 = 103×211+1 34.01281014- 34d IEEE 754-2008 decimal128 3,4,28,29,37,38
5.2×109824 = 2215-131 16.0 ? 216 47 PRIME 5016
1.9×1029603 = 8215+12 ? ? 816 ? Burroughs 6700, 7700 double1,16
1.6×1078913 = 2218 71.3256 2191+236 IEEE 754-2008 binary256
4.3×102525222 = 2223 ? ? 224 ? PARI/GP (older 32-bit versions)
2.05×10161614248 = 2229 ? ? 230 ? PARI/GP (newer 32-bit versions)
1.23×10323228458 = 2230-128 ? ? 231 ? Mathematica® (some versions)
2.1×10323228496 = 2230-1 VV 232 V GNU MPFR (32-bit)
1.92×10646456887 = 2231-352 ? ? 232 ? Mathematica® (other versions)
102147483646 = 10231-2 ? ? ? ? ? Maple® (32-bit)
106.94×1017 = 2261 ? ? 262 ? PARI/GP (64-bit)
101.39×1018 = 2262-1 VV 264 V GNU MPFR (64-bit)
109.22×1018 = 10263 ? ? ? ? ? Maple® (64-bit)
101010000 or more VV 2V V Maxima
10↑↑10 or more - - - - 19d WolframAlpha
10↑↑(1010) 300.0 - 10 - 300d Hypercalc

Legend:

B : Base of exponent. This is the amount by which your floating-point number increases if you raise its exponent by 1. Modern formats like IEEE 754 all use base 2, so B is 2, and increasing the exponent field by 1 amounts to multiplying the number by 2. Older formats used base 8, 10 or 16.

V : The size of this field, and therefore the precision, number of "digits", and/or exponent range, is variable and limited only by available memory and how long you're willing to wait for a calculation. For example, one version of Maxima takes over 4 hours to give an answer for "bfloat(10.0)^(bfloat(2.0)^bfloat(100000.0));".

We : Width of exponent. If B is 2, 8 or 16, this is the number of bits (binary digits) in the exponent field. For the specific case of B=2, We is equal to K+1 in the equation 1-2K<e<2K specifying the bounds of the excess-2n exponent in an IEEE 754 representation (see below). When B is 10, there are two cases: "6d" indicates an exponent stored as base-10 digits and the letter d is included to make this clear; "8-" indicates an IEEE binary decimal format, using 2 bits in the combination field and 6 bits in the following exponent field, which together can hold only 3/4 of the values such a width would imply (because the high 2 bits cannot both be 1), thus the legal values are e such that 0≤e<3×26.

Wm : Width of mantissa (or "fraction"). For binary formats with "hidden" or "implicit" leading mantissa bits, this is given as "1+N", such as "1+23", the "1+" refers to the leading 1 bit; this plus 23 actual bits gives a total of 24 bits of precision. For decimal formats the letter "d" is shown to make it clear the precision is in decimal digits.

IEEE 754 Single Representation

This is worth describing in a bit more detail because it is so prevalent in the hardware used today, and it is probably what you'll be looking at when you try to decipher a floating-point value from its "raw binary".

First a warning: Although the "normal" values are what you see when your program is working with real data, proper handling of the rest of the values (denorms, NANs, etc.) is vitally important; otherwise you'll get all sorts of horrible results that are difficult to understand, and usually impossible to fix.

So, for the normal values (which in this case means, not including the zeros, denorms, NANs, and infinities) the value being represented can be expressed in the following form:

value = s 2k+1-N n

where the sign s is -1 or 1, and k and n are integers that fall within the ranges given by:

2-2K < k < 2K-1      and      2N-1-1 < n < 2N

for two integers K and N. If you look at the range of k and n you can see that k can have exactly 2K+1-2 values and n can have exactly 2N-1 values, and therefore exactly K+1 bits can be used to store the exponent (including two unused values discussed below) and N-1 bits to store the mantissa. To give a specific example, for IEEE 754 single precision, as the above table shows there are We=8 bits for the exponent and Wm=23 bits for the mantissa, so K is 7 and N is 24.

The exponent is stored in "excess 2K format", which means the binary value you see is 2K bigger than the actual value of k being represented. For example, when K is 7, is the value 254 is seen, k is 126, and the value being represented is s  2127-N  n. This is only true for the normal values just described, not for denorms.

The next set of values to understand are the denormalized values (or "denorms"), very small values for which

k = 2-2K      and      0 < n < 2N-1

using the same definitions as above. These values use one of the "unused" exponent values, namely the one that is all 0 bits. They are very important because they make overflow work better: instead of jumping suddenly to 0, you lose precision gradually as you go towards 0.

In addition to making the underflow case a little less severe by losing precision gradually instead of suddenly, denormalized values eliminate a lot of strange bugs that would otherwise occur. For example, the tests "if  x>y" and "if  x-y>0" can yield different results, unless you use denorms.

All of the various values are arranged in such a way that hardware or software can perform comparisons treating the data as signed-magnitude integers, and as long as neither argument is a NAN the proper answer will result. Such comparisons even properly handle the infinities and negative zero. (A signed-magnitude integer is a sign bit followed by an unsigned expression of its magnitude — this is not the normal signed integer format which is called "2's complement signed integer". As with floats, there are ways to express 0 as a signed-magnitude integer.)

Here are some sample values with their binary representation. The binary digits are broken into groups of 4 to help with interpreting a value in hexadecimal. They are shown in order from largest to smallest, with the non-numbers in the places they would fall if they were sorted by their bit patterns.

s exponent mantissa value(s)
0 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs
0 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
0 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
0 111.1111.1 000.0000.0000.0000.0000.0000 Infinity
0 111.1111.0 111.1111.1111.1111.1111.1111 3.402×1038
0 100.0000.1 000.0000.0000.0000.0000.0000 4.0
0 100.0000.0 100.0000.0000.0000.0000.0000 3.0
0 100.0000.0 000.0000.0000.0000.0000.0000 2.0
0 011.1111.1 000.0000.0000.0000.0000.0000 1.0
0 011.1111.0 000.0000.0000.0000.0000.0000 0.5
0 000.0000.1 000.0000.0000.0000.0000.0000 1.175×10-38 (Smallest normalized value)
0 000.0000.0 111.1111.1111.1111.1111.1111 1.175×10-38 (Largest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0001 1.401×10-45 (Smallest denormalized value)
0 000.0000.0 000.0000.0000.0000.0000.0000 0
1 000.0000.0 000.0000.0000.0000.0000.0000 -0
1 000.0000.0 000.0000.0000.0000.0000.0001 -1.401×10-45 (Smallest denormalized value)
1 000.0000.0 111.1111.1111.1111.1111.1111 -1.175×10-38 (Largest denormalized value)
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38 (Smallest normalized value)
1 011.1111.0 000.0000.0000.0000.0000.0000 -0.5
1 011.1111.1 000.0000.0000.0000.0000.0000 -1.0
1 100.0000.0 000.0000.0000.0000.0000.0000 -2.0
1 100.0000.0 100.0000.0000.0000.0000.0000 -3.0
1 100.0000.1 000.0000.0000.0000.0000.0000 -4.0
1 000.0000.1 000.0000.0000.0000.0000.0000 -1.175×10-38
1 111.1111.0 111.1111.1111.1111.1111.1111 -3.402×1038
1 111.1111.1 000.0000.0000.0000.0000.0000 Negative infinity
1 111.1111.1 0xx.xxxx.xxxx.xxxx.xxxx.xxxx Signaling NANs
1 111.1111.1 100.0000.0000.0000.0000.0000 Indeterminate
1 111.1111.1 111.1111.1111.1111.1111.1111 Quiet NANs

IEEE 754d Decimal Formats

The decimal32, decimal64 and decimal128 formats defined in the IEEE 754-2008 standard are interesting largely because of their innovative packing of 3 decimal digits into 10 binary digits. Decimal formats are still useful because they can store decimal fractions (like 0.01) precisely. Normal BCD (binary-coded decimal) uses 4 binary digits for each decimal digit, requiring a waste of about 17% of the information capacity of the bits. The 1000 combinations of 3 decimal digits fit nearly perfectly into the 1024 combinations of 10 binary digits. In addition to the space efficiency, groups of 3 work well for formatting and printing which typically use a thousands separator (such as "," or a blank space) between groups of 3 digits. However, prospects for easy encoding and decoding seem bleak. In 1975 Chen and Ho published the first such system, but it had some drawbacks. The Cowlishaw encoding4, used in IEEE 754-2008, is remarkable because it manages to achieve all of the following desirable goals:


Minifloats and Microfloats: Excessively Small Floating-Point Formats

Although they do not have much practical value as a universal format for computation, very small floating-point formats are of interest for other reasons.

One can refer to a format using 16 bits or less as a minifloat. (For origin of the term, see footnotes: 42,43,44.) Of these, the most popular by far is 1.5.10 (or s10e5 or binary16), the 16-bit format invented at nVidia and ILM and now a part of IEEE 754-2008. This format uses 1 sign bit, a 5-bit excess-15 exponent, 10 mantissa bits (with an implicit 1 bit) and all the standard IEEE rules including denormals, infinities and NaNs. The minimum and maximum representable (and positive) values are 5.96×10-8 and 65504 respectively.

sexpon. mantissa value(s)
0 111.11 xx.xxxx.xxxx various NANs
0 111.11 00.0000.0000 Infinity
0 111.10 11.1111.1111 65504 (Largest finite value)
0 100.11 10.1100.0000 27.0
0 100.01 11.0000.0000 7.0
0 100.00 10.0000.0000 3.0
0 011.11 00.0000.0000 1.0
0 011.10 00.0000.0000 0.5
0 000.01 00.0000.0000 6.104×10-5 (Smallest normalized value)
0 000.00 11.1111.1111 6.098×10-5 (Largest denormalized value)
0 000.00 00.0000.0001 5.96×10-8 (Smallest denormalized value)
0 000.00 00.0000.0000 0
1 011.11 00.0000.0000 -1.0 (other negative values are analogous)

This format is supported in hardware by many nVidia graphics cards including GeForce FX and Quadro FX 3D (they call it fp16 or s10e5), and is used by Industrial Light and Magic (as part of their OpenEXR standard) and Pixar as the native format for raw output rendered frames (prior to conversion to a compressed format like DVD, HDTV, or imaging on photographic film for exhibition in a theater). s10e5 is more than sufficient to represent light levels in a rendered image, and compared to 32-bit floating-point, it presents quite a few advantages: it requires half as much memory space (and bus bandwidth); an operation (such as addition or multiplication) takes less than half the time (as measured in gate delay) and about 1/4 as many transistors. All of these advantages are very important when you are expected to perform trillions of operations to render a frame.

To give a concrete example: at the time of the 3-GHz Pentium, which was capable of 12 billion floating-point operations per second (12 GFLOPs), nVidia graphics cards for consumers could manage around 40 billion operations per second. Soon after that, ATI (which uses 24-bit 1.7.16 or s16e7 format) surpassed that, and the two companies repeatedly leapfrogged each other. In subsequent years, the graphics cards continued to widen their lead over CPUs, and even when 32-bit floating-point became common on graphics cards, 16-bit is still very commonly used typically because it presents a lesser load on membory bandwidth.

The computer-graphics industry has long recognized the value of floating-point to represent pixels, because a pixel expresses (essentially) a light level. Light levels can vary over a very wide range — for example, the ratio between broad daylight and a clear night under a full moon, is 14 "magnitudes" on the scale used by astronomers. That's 2.51214 ≈ 400,000. The ratio of brightnesses in nighttime environments with bright lights (such as when driving at night, or in a candlelit room) are similar. Such scenes have "high-contrast" lighting. The human eye can handle this range easily. A standard 8-bit format for pixel values (typically 8 bits for each of the three components red, green and blue) doesn't even come close. Doubling the pixel width to 16 bits produces the 48-bit format (common in the industry) but does little to improve the situation for high-contrast lighting — for pixel values near the bottom of the range, roundoff error is terrible. But using 1.5.10 float format increases the range to over 109 (values as small as 6.1×10-5 and as large as 65504), with the equivalent of 3 decimal digits of precision over the entire range. It can also represent any integer from -2048 to 2048, so is even useful in some situations like pixel addresses within a texture.

A software implementation of s10e5 is here: s10e5 C++ class.

Microfloats

A floating-point format using 8 bits or less fits in a byte; I call this a microfloat. These are the best for learning, particularly when you have to convert to/from floating-point using pencil and paper. I am not alone in thinking they are useful as an educational tool for learning about and practising the implementation of floating-point algorithms — I have found courses at no fewer than 11 colleges and universities that use them in lectures.21,22,23

But surprisingly, such small representations even have use in the real world — sort of. Some encodings used for waveforms and other time-variable analog data are very close to being a floating-point encoding with a small number of exponent and mantissa bits. An example is "mu-law" coding used for audio. Such codes usually store the logarithm of a value plus a sign, and have a special value for zero. This is not the same as a true floating-point format, but it has a similar range and precision.

The smallest format that has all three fields would be 1.1.1 format — using 3 bits with one bit each for sign, exponent and mantissa. 1.1.1 format encodes the values {-3, -2, -1, -0, 0, 1, 2, 3} or an equivalent set multiplied by a scaling factor. But this isn't very "useful" because you can do a little better just by treating the 3 bits as a signed integer (which gives you the integers -4 through 3).

The smallest formats that are "useful" in the sense of covering a broader range than the same number of bits as a signed integer have at least a 2-bit exponent field. There is always at least 1 mantissa bit anyway (the hidden leading 1, or leading 0 for the denormalized values when the exponent field is 0). The smallest of these is 1.2.0 format — three bits, encoding the values {-4, -2, -1, -0, 0, 1, 2, 4}.

Adding one mantissa bit to get the 4-bit format 1.2.1 gives us a lot more — it encodes the set {-12, -8, -6, -4, -3, -2, -1, -0, 0, 1, 2, 3, 4, 6, 8, 12} giving quite a bit more than the range of the 4-bit signed integer {-8 ... 7}.

5 bits are best used in a 1.2.2 format, using 1 sign bit, 2 exponent bits and 2 mantissa bits (plus an implicit leading 1 bit for a mantissa precision of 3 bits). If the exponent is treated as excess -2 (that's "excess minus-two"), all representable values are integers and the range is {-28 .. 28} (or {-24 .. 24} if the highest exponent value is used for infinities). 5 bits as a normal two's complement integer has a range of {-16 .. 15}.

Reader George Spelvin34 pointed out that an "all-integer" 0.2.3 format, with no sign and without denormals or infinities, is used in the command for setting the keyboard repeat rate (the "typematic rate") of an IBM PC keyboard. There is a 5-bit field whose 32 possible values are used for the numbers 8 through 120, as follows:

bits value bits value bits value bits value 00.000 8 01.000 16 10.000 32 11.000 64 00.001 9 01.001 18 10.001 36 11.001 72 00.010 10 01.010 20 10.010 40 11.010 80 00.011 11 01.011 22 10.011 44 11.011 88 00.100 12 01.100 24 10.100 48 11.100 96 00.101 13 01.101 26 10.101 52 11.101 104 00.110 14 01.110 28 10.110 56 11.110 112 00.111 15 01.111 30 10.111 60 11.111 120

These are used to indicate inter-character delay values of 8/240 through 120/240 of a second (i.e. the fastest rate is 30 characters per second and the slowest is 2 per second). This is like a 2-bit "excess -3" exponent and a 3-bit mantissa.

In separate, earlier correspondence, Spelvin suggested other similar all-integer formats, with denormals but without infinities or NANs. The exponent excess is taken to be whatever value causes the denorms to use the same storage format as the corresponding integer.

Using 0.5.3 format as an example: There is no sign bit, so all values are positive. When the exponent field is 0, the mantissa is denormalized. So the values (in binary) 00000.000 through 00000.111 express the integers 0 through 7. The next exponent value is 00001 in binary, its 1 bit happens to correspond with the implicit leading 1 of the (now normalized) mantissa, so values 00001.000 through 00001.111 express integers 8 through 15. Notice how all of these values for the integers 0 through 15 are the same as the normal 8-bit integer representation.

After that, values scale in the normal way: 00010.000 through 00010.111 expresses the even integers 16 through 30 (note that only the first of these corresponds to the integer representation); 00011.000 through 00011.111 are the multiples of 4 from 32 through 60; and so on. The highest value is 11111.111 which is 15×230 = 225-2×(23+1-1) = 16106127360. Another similar format is 0.4.4, excess -4, which expresses integers from 0 up to 224-2×(24+1-1) = 507904.

In general, using E exponent bits and M mantissa bits, you can express all integers from 0 to 2M+1, and various higher values up to 22E-2×(2M+1-1).

Here is a table presenting most of the smaller entries from the main table in a somewhat different format, along with the integer-only formats that bias the exponent so that the smallest denorm is 1.

s.e.m excess range comments
1.1.1 0 1 to 3 Less range than signed-magnitude integer
1.2.0 0 1 to 4 The smallest format whose range exceeds that of the same number of bits interpreted as a signed-magnitude integer
1.2.1 0 1 to 12 Best use of 4 bits
1.2.2 -2 1 to 28 Using no infinity values; range is 1 to 24 if the biggest values are used for infinities
0.2.3 -3 8 to 120 IBM PC typematic parameter34
1.3.2 3 0.0625 to 14 Used in university courses21,22
0.4.4 -4 1 to 507904 George Spelvin34
0.5.3 -8 1 to 16106127360 George Spelvin34
1.4.3 -3 1 to 229376 About the best compromise for a 1-byte format
1.4.3 7 0.002 to 240 Used in university courses21,22
1.4.7 -7 1 to 4161536 One option for 12 bits
1.5.2 15 0.000015 to 57344 Proposed for deep learning 48
1.5.6 -6 1 to 1.35×1011 Another option for 12 bits
1.5.10 -10 1 to 2.20×1012 Largest unbalanced format; range exceeds 32-bit unsigned
1.5.10 15 5.96×10-8 to 65504 IEEE 754-2008 binary16, also called "half", "s10e5", "fp16". 2-byte, excess-15 exponent24,25,27,33. Can approximate any 16-bit unsigned integer or its reciprocal to 3 decimal places.
1.6.9 31 1/M to 4290772992 Proposed for deep learning 48
1.8.7 127 1/M to 3.39×1038 bfloat16 (Intel Nervana and Cooper Lake, Google TPUs, ARMv8.6-A)
1.5.12 15 1/M to 2.81×1014 A fairly decent radix-8 format in an 18 bit PDP-10 word
1.8.10 127 1/M to 3.40×1038 NVidia 'TenforFloat'
1.7.16 63 1/M to 9.22×1018 3-byte excess-6317, aka "fp24", "s16e7"


Footnotes

1 : http://http.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf W. Kahan, "Why do we need a floating-point arithmetic standard?", 1981

2 : http://http.cs.berkeley.edu/~wkahan/ieee754status/Names.pdf W. Kahan, "Names for Standardized Floating-Point Formats", 2002 (work in progress)

3 : http://754r.ucbtest.org/ "Some Proposals for Revising ANSI/IEEE Std 754-1985"

4 : http://www2.hursley.ibm.com/decimal/DPDecimal.html "A Summary of Densely Packed Decimal encoding" (web page)

5 : http://www3.sk.sympatico.ca/jbayko/cpu1.html

6 : http://twins.pmf.ukim.edu.mk/predava/DSM/procesor/float.htm

7 : http://www.cc.gatech.edu/gvu/people/randy.carpenter/folklore/v5n2.html

8 : http://www.usm.uni-muenchen.de/people/puls/f77to90/cray.html

9 : http://www.usm.uni-muenchen.de/people/puls/f77to90/alpha.html

10 : http://www.research.ibm.com/journal/rd/435/schwarz.html and http://www.research.ibm.com/journal/rd/435/schwa1.gif

11 : http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html

12 : One source gave 831 as the range for the Burroughs B5500. (I forgot to save my source for this. I have sources for other Burroughs systems, giving 876 as the highest value (and 8-50 as the lowest, for a field width of 7 bits). I might have inferred it from http://www.cs.science.cmu.ac.th/panutson/433.htm which only gives a field width of 6 bits, and no bias. The Burroughs 5000 manual says the mantissa is 39 bits, but does not talk about exponent range. Did some models have a 6-bit exponent field? Since these are the folks who simplified things by storing all integers as floating-point numbers with an exponent of 0 17, I suspect anything is possible.

13 : http://www2.hursley.ibm.com/decimal/IEEE-cowlishaw-arith16.pdf Michael F. Cowlishaw, "Decimal Floating-Point: Algorism for Computers", Proceedings of the 16th IEEE Symposium on Computer Arithmetic, 2003; ISSN 1063-6889/03

14 : http://research.microsoft.com/users/GBell/Computer_Structures_Principles_and_Examples/csp0146.htm D. Siewiorek, C. Gordon Bell and Allen Newell, "Computer Structures: Principles and Examples", 1982, p. 130

15 : http://research.microsoft.com/~gbell/Computer_Structures__Readings_and_Examples/00000612.htm C. Gordon Bell and Allen Newell, "Computer Structures: Readings and Examples", 1971, p. 592.

16 : http://www.netlib.org/slap/slapqc.tgz FORTRAN-90 implementation of a linear algebra package, including a file (mach.f) which curiously begins with a table of machine floating-point register parameters for lots of old mainframes. See also the NETLIB D1MACH fiunction, which gives similar values for many systems. (formerly at http://www.csit.fsu.edu/~burkardt/f_src/slap/slap.f90 and http://interval.louisiana.edu/pub/interval_math/Fortran_90_software/d1i1mach.for respectively)

17 : http://grouper.ieee.org/groups/754/meeting-minutes/02-04-18.html Includes this brief description of the key design feature of the Burroughs B5500: "ints and floats with the same value have the same strings in registers and memory. The octal point at the right, zero exponent." This shows why the exponent range is quoted as 8-50 (or 8-51) to 876: The exponent ranged from 8-63 to 863, and the (for floating-point, always normalized) 13-digit mantissa held any value from 812 up to nearly 813, shifting both ends of the range up by that amount.

18 : http://www.inwap.com/pdp10/hbaker/pdp-10/Floating-Point.html

19 : http://nssdc.gsfc.nasa.gov/nssdc/formats/PDP-11.htm

20 : This format would be easy to implement on an 8-bit microprocessor. It has the sign and exponent in one byte, and a 16-bit mantissa and an explicit leading 1 bit (if the leading 1 is hidden/implicit, we get a little more precision). With only 4-5 decimal digits and a range of about ±1019, it's not too useful for financial or scientific problems — but that's what one would expect to see on the really early home computers.

21 : http://www.csc.kth.se/utbildning/kth/kurser/DD2377/maskin07/lectures/f3b-floats.pdf This lecture presentation (or a variation of it) has been used at clarkson.edu, cmu.edu, plymouth.edu, sc.edu, ucar.edu, umd.edu, umn.edu, utah.edu, utexas.edu, vancouver.wsu.edu, and www.csc.kth.se. It is a good discussion of floating-point representations, subnormals, rounding modes and various other issues. Pages 14-16 use the 1.4.3 microfloat format as an example to illustrate in a very concrete way how the subnormals, normals and NANs are related; pages 17-18 use the even smaller 1.3.2 format to show the range of representable values on a number line. Make sure to see page 30 — this alone is worth the effort of downloading and viewing the document!

22 : http://www-2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f98/H3/H3.pdf This is a homework assignment that uses the microfloat formats 1.4.3 and 1.3.2. Another similar one is here.

23 : http://www.arl.wustl.edu/~lockwood/class/cse306-s04/lecture/l11.html Lecture notes that use the 1.3.5 minifloat format for in-class examples.

24 : http://developer.nvidia.com/docs/IO/8230/D3DTutorial1_Shaders.ppt nVidia presentation describing their fp16 format (starting on slide 75).

25 : http://developer.nvidia.com/attach/6655 nVidia language specification including definition of fp16 format (page 175).

26 : http://www.cs.unc.edu/Events/Conferences/GP2/slides/hanrahan.pdf describes the nVidia GeForce 6800 and ATI Radeon 9800 graphics cards as general-purpose pipelined vector floating-point processors, and shows a rough design for a supercomputer employing 16384 of the GPU chips to achieve a theoretical throughput of 2 petaflops (2×1015 floating-point operations per second). The rackmount pictured is described here.

27 : http://www.digit-life.com/articles2/ps-precision/ This is the only source I have found that describes all of the current hardware standard formats, from IEEE binary128 all the way down to nVIDIA s10e5.

28 : Wikipedia, IEEE 754 revision. Good summary of the development of IEEE 754-2008.

29 : http://grouper.ieee.org/groups/754/revision.html Official status of the IEEE working group responsible for 754r.

30 : Steven Vickers, editied by Robin Bradbeer "ZX Spectrum Basic Programming" 2nd edition 1983, Sinclair Research, pp 169-170. (via Lennart Benschop)

31 : http://en.wikipedia.org/wiki/Floating_point Wikipedia, Floating point (encyclopedia article). While it's possible the idea of floating-point might have been devised for use in mechanical calculators, Konrad Zuse had formulated the ideas behind his model Z3 before building the Z1, and the Z3 is generally regarded as the first generally-programmable computer (more on that topic here),

32 : http://www.epemag.com/zuse/part3c.htm Horst Zuse, The Life and Work of Konrad Zuse. The Zuse Z1 took numeric input from the operator in decimal form, and then converted it to binary. For output, binary was converted back to decimal. The input and output devices both used 5 decimal digits and an exponent ranging from 10-8 to 108. However, the internal representation had 7 binary digits of exponent, so the range for intermediate calculations was somewhat larger — perhaps 263 or 264. Zuse Z3 was similar, but had 4 or 5 digits and exponent ranges of -9 to 9 (for input) and -13 to +12 (for output).

33 : http://www.gpgpu.org/s2004/slides/buck.StrategiesAndTricks.ppt Ian Buck, GPU Computation Strategies & Tricks, PowerPoint slides. Slide 4 describes the ATI and nVidia floating-point formats at the time (2004).

34 : George Spelvin, email correspondence.

35 : http://www.trs-80.com/trs80-zaps-internals.htm The TRS-80 passes 4-byte single-precision (and with Level II BASIC, 8-byte double-precision) values into and out of its ROM routines, and it is clear that one byte is an exponent. The exponent is often described as being in excess-128 (or "XS128") format. However, as reported by reader Ulrich Müller, emulators show that the range is 2127, and that the internal representation actually uses excess-129.

36 : Joe Zbiciak, email correspondence. The TI 99/4A uses radix 100 in an 8-byte storage format. 7 bytes are base-100 mantissa "digits" (equivalent to 14 decimal digits), and the exponent (a value from -64 to 63) is stored in the 8th byte along with a sign bit. The exponent is treated as a power of 100. The largest-magnitude values are ±99.999999999999×10063, and the smallest-magnitude values (apart from 0) are ±1×100-64. Precision varies from just over 12 decimal digits to just under 14: for example, π/3 is 01.047197551197×1000 and 3/π is 0.95492965855137 (represented as 95.492965855137×100-1).

37 : Wikipedia, IEEE 754-2008.

38 : Wikipedia, Decimal128 floating-point format.

39 : Robert Munafo, F107 and F161 High-Precision Floating-Point Data Types

40 : Dekker, T. (1971) A floating-point technique for extending the available precision. In Numerische Mathematik 18, 224-242.

41 : For many more references on double-double and quad-double techniques, see the bibliography on my f107 and f161 page.

42 : Queen's University, Minifloat.java (source code), August 20, 2003. This is a confirmed source of the term "minifloat" that pre-dates my usage. The link was http://www.caslab.queensu.ca/~apsc142i/W2003/lecturenotes/Section_BCDJ/Lecture15/Minifloat.java, and this link is dead, but it can probably still be viewed at archive.org here: Minifloat.java (20030820).

43 : Robert Munafo, Survey of Floating-Point Formats, Sep 17, 2004. (Note the lack of the words "minifloat" and "microfloat".)

44 : Robert Munafo, Survey of Floating-Point Formats, Sep 22, 2004. This page uses the words minifloat and microfloat.

45 : Industrial Light & Magic (a division of Lucas Digital Ltd. LLC), s10e5 C++ class, with rounding improvements and other changes by Robert Munafo. Use is subject to the copyright and distribution conditions in its header comment. There are two other files (which I have not used): s10e5-limits.h and s10e5-function.h.

46 : John J. G. Savard ("quadibloc"), Floating-Point Formats fills in many of the details for old mainframe computers with nice color-coded diagrams for each.

47 : Vincent Lefevre, email correspondence.

48 : Naigang Wang et al., "Training Deep Neural Networks with 8-bit Floating Point Numbers", 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), MontreĢal, Canad

(This paper proposes "1.5.2" and "1.5.12" formats to be used in deep neural networks (DNN, "deep learning") hardware, but does not specify the exponent bias. The two formats are meant to be used together, specifically to cover the same range with the 16-bit format being used for intermediate operations such as summing together a lot of terms.)

Other Sources

Lennart Benschop

datapeak.net Computer History (a long timeline of events in reverse order, with many pictures)

Previous addresses of This Page

home.earthlink.net/~mrob/pub/math/floatformats.html


See also:

Alternative Number Formats

F107 and F161


Robert Munafo's home pages on AWS    © 1996-2024 Robert P. Munafo.    about    contact
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Details here.

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2020 Jun 18. s.27