Floating Point Precision

Floating point numbers are what computers use to approximate decimal point numbers like 45.6 or 327.4, but also potentially more interesting numbers like 3.14159265…, which, if you were to make a top 10 list of important numbers for gaming graphics, it would easily make the top spot by far. And what’s the real meaning of single and double precision, and all this talk about FP16?

Decimal point numbers can be said to sit between the whole numbers (integers, natural/countable numbers) on one side, and the complex numbers on the other. They represent what’s “between the whole numbers” if you will. Number theory is a whole field of mathematics that explains this more precisely, and group theory explains what happens when we do things with the numbers, like add them together for example. For now, let’s keep it simpler.

In order to work with numbers on a computer we’ll need to be a bit clever. It’s easy to forget these days, but computers can actually only work with zeros and ones. If we want to work with any other numbers, even just simple integers, we need to represent these numbers using zeros and ones in a way that the computer can work with. For whole numbers, computers use something called binary two’s complement representation. For the decimal point numbers, completely different representations are used, that are incompatible with integers, and they are also incompatible between different precision formats. Computers can only do operations like adding between numbers of the same format, otherwise they need to be converted first.

So how would we want to represent decimal point numbers using only zeros and ones?

A simple application of decimal point numbers is an online shopping cart. You may be adding a $178.99 graphics card. We’ll quickly realize that there are 100 cents between each dollar, it can go from 178.00 up to 178.99 and then we go to 179.00. There’s really nothing in between that we’re interested in, a price of $178.994 has no particular meaning, and it’s the same for most common currencies. Historically there have been quite a few non-decimal currencies, but they are all but extinct in the modern world.

Clearly, a dead simple representation of this would just be to use two integers, one for the dollars and one for the cents. We could also just multiply the dollar amount by 100 and then add the cents, so $178.99 would be represented as 17899, and then we just remember to divide by 100 later when we want to use it. This is fixed point arithmetic, and is actually used quite frequently when we’re only interested in a fixed number of decimals after the decimal point. Clearly we can multiply by 10 or a 1000 and get whichever number of decimals we want.

However, if we were to represent π as 3.14 instead of 3.1415926535…, we would be creating a fairly significant error, and our circles would be less than amazing. We can do better than that.

Now, there’s a fairly significant issue that I haven’t mentioned yet. The fixed decimal point numbers above are examples of what’s called rational numbers (ℚ) in mathematics. They are all the numbers that can be expressed as a quotient or two integers, for example 3/5, or 99/100 as in our shopping cart example above. These are easy to represent in a computer, just store the two integers separately. Computers don’t actually do this in hardware because they’re doing something even more clever, but we can do this in software fairly efficiently if we need to.

If you’re following along so far, then hold on to your horse, because there are even more numbers between the rational numbers!

This isn’t exactly how a mathematician would express it, but conceptually there are numbers “between” the rational numbers, and these are the numbers that we’re most interested in for computer graphics. These numbers are called irrational, some famous examples are √2, π, e, and φ. The square root of 2 is the length of the diagonal of a right triangle with legs of length 1, we all know the importance of triangles in computer graphics. Pi is the ratio between a circle’s circumference and its diameter, but shows up in all kinds of places, some very surprising and unexpected. Euler’s number e is possibly a bit less known, but is equally important, especially if we’re doing statistics and probabilities, but also in optimization and combinatorics. The golden ratio φ is well known to anyone working in art, design and architecture, as it has an interesting “pleasant” property to humans. It also shows up in nature and biology, but really, in all kinds of interesting places. These numbers have a profound importance to things we humans want to be able to do, and thus we need to be able to store them on computers and work with them.

With this knowledge, for an amazing mathematical identity that will blow your mind if it’s not already, check this out:

Search for Euler’s Identity for more.

There’s just a catch: Computers cannot store these numbers. What’s worse is that computers as we know them today will never be able to store them. It’s not a matter of technological development, it is forever impossible.

Written out, the golden ratio is 1.6180339887…

The interesting part is the … that comes last. It represents a sequence of digits that goes on forever. Forever is a very long time. Computers are finite machines, built with finite amounts of materials, and cannot store numbers with infinite precision. We couldn’t store even a single one of these irrational numbers, but there are also infinitely many of them. We need another strategy.

Instead, what we can do with these numbers is choose some level of precision. We’ll pick some approximation of the irrational numbers and live with the fact that there will be a small error. This is effectively deciding how many decimals to put after the decimal point, and in theory we can make this as large as we want, as long as what we want is finite.

There is also another related issue. The finite reality of computers means not only that we are limited to a finite precision, but we’re also limited in how large numbers we can represent. Just as a simple example, a 32-bit computer would be able to store integers up to 4,294,967,295, or a bit over 4 billion. This is a very large number, but still not large enough if we wanted to store the number of humans on earth for example. With 64-bit computers we increase this by a lot, but we still wouldn’t be able to store the number of km to the nearest star.

Floating point numbers in computers are designed to represent much larger numbers, and with good enough precision for most applications. These floating point numbers are standardized by IEEE, and are implemented in hardware on all major modern computers. When we talk about single or double precision, we’re specifically talking about how many bits of storage we use for each such number. The more bits, the larger and more precise numbers we can store.

Precision Name Bits Example
Half FP16 16 0.123
Single FP32 32 0.1234567
Double FP64 64 0.123456789012345
X86 extended   80 0.1234567890123456789
Quad FP128 128 0.1234567890123456789012345678901234

If we do a tear-down of a single precision floating point number (32 bits) and look at the internals, we see that it’s represented like this:

Conceptually, there are two fixed point numbers, one for fraction and one for exponent, and then one bit to represent whether it’s a positive or negative number, all packed into a single storage unit. For half or double precision, these numbers are smaller or larger respectively. Computers have hardware registers that can store these numbers, and hardware instructions that can do arithmetic on them. Modern computers even have SIMD registers where several floating point numbers can be stored in one large vector, often 256 bits or more.

Without getting into all the details of the exact encoding, a floating point number is conceptually similar to scientific notation, where we might write something like 1.234 x 10^56 (^ representing raised to the power of). 10 is called the base here, and while we use a base of 10 in daily use (10 fingers, how hard can counting be?), computers prefer a base of 2, so that’s what floating point numbers use. It also doesn’t need to be stored, since it’s always the same for all numbers.

In the early days of floating point units, it would be significantly slower to work with floating point numbers, but these days they are very fast even compared to integers. Modern processors can finish at least one floating point instruction per clock cycle, and more when using SIMD. Multiplication also used to be slower than addition and subtraction, but this isn’t really the case anymore. Division is a bit slower, but still not anywhere near as bad as it used to be. The real performance killer is integer division, it is much slower than floating point division, and something for developers to avoid like the plague.

It is also often the case that single and double precision arithmetic is just about equally fast on CPUs. This is maybe somewhat surprising, but given 64-bit computers that can operate on the full 64 bits of a double precision number in one go, it’s fairly easy to understand why this is the case. Going down to single precision 32-bit numbers still ends up using the same 64-bit execution units, they just leave half of them unused. X86 processors also have a special internal 80-bit floating point format, but half precision and quad precision numbers aren’t normally available on CPUs.

Rapid Packed Math

However, the same isn’t true for GPUs, and interestingly, on Skylake and later, the on-chip GPU does have support for FP16, and can be used via OpenCL for example.

With Radeon Vega, AMD brought Rapid Packed Math to consumer cards, which is a way of packing two FP16 operation into a single FP32 slot. Think of it as a micro-SIMD unit, and it allows twice as many operations to be performed in the same time. Obviously, as we saw above, FP16 significantly limits the precision of the numbers, but for applications where this is acceptable, the performance gained is significant. Nvidia later introduced similar capability in the Turing architecture.

The key thing about FP16 is obviously that they are smaller and faster. Smaller means we can store twice as many numbers in the same amount of memory, and faster means we can ideally perform twice as many operations in the same time. The drawback is that we can only store something like three decimals, and can’t use very large numbers. For some applications it may be fine though, for example in computer graphics, where things ultimately get projected onto a discrete grid of pixels and forcefully rounded, or some applications in machine learning, where we may have access to vast amounts of data, and the precision of each data point is not all that important.

Special Properties

As we understand from the description above, floating point numbers are truncated to a certain number of decimal digits. From a mathematical perspective, floating point numbers are rational numbers, i.e. they can be represented as a ratio between two integers. Floating point numbers are not real numbers, and irrational numbers must be approximated. Because of the encoding, floating point numbers can’t represent all possible values, and we can sometimes see this end converting numbers from text for example. In text format, we can clearly write any decimal point number we want, but when converting it to a floating point number we may or may not get the exact same number.

Clearly, when doing multiple operations on floating point numbers in sequence, such as in a mathematical formula, we get an accumulating error for each operation. This has an impact on developers, since many mathematical formulas can be reordered and still be equivalent. We would expect A = B + C * D to be equivalent to A = D * C + B, for example. With floating point numbers, the error can and will end up being different depending on the order of the operations, and thus the order of calculations will give different results! It is also possible to make the error larger or smaller by reordering the operations. In mathematical terms, floating point addition (as an example) is not associative, but this is a property we expect from rational numbers. Many programmers are unaware, and this can be a source of very subtle bugs where things can look perfectly right in testing, but can give wrong results when used on real life data.

There are some other special oddities with floating point numbers. For example, zero is a signed number, meaning positive zero and negative zero are two distinctly different numbers. For comparisons they behave the same, but they can give different results for some other operations. Division by zero is one such operation, and in IEEE floating point numbers, dividing by negative zero results in negative infinity, and with positive zero gives positive infinity.

And yes, there is a way to represent infinity in floating point numbers. Actually two infinities, one positive and one negative. So we can do things like:

Inf + 7 = Inf


Inf x -2 = -Inf

But what happens when we do:

Inf x 0 = ???

When multiplying infinity by zero, there really is nothing meaningful to do. The result is not really something we consider a number at all, and floating point numbers actually have a way of representing this too. NaN stands for not a number, and is exactly that, a way of representing a number that is not a number. In applications, this is typically not something we’re interested in, and when we come across this, it is typically an error. The floating point format is designed to behave sensibly with these kinds of errors, but in broken code they can end up escalating quickly. NaNs have the property that they stick, once your numbers become NaNs, they propagate to everything they touch. This is another common source of numerical errors in programs, and the root cause of the bug can be exceptionally hard to locate, and also depends on the input data used.

There are also denormal or subnormal numbers, which are a special way of encoding smaller numbers than we otherwise could, while sacrificing some precision. In certain hardware implementations, these subnormal numbers are significantly slower than regular floating point numbers, and can be yet another source of performance bugs that can be difficult to solve.


Floating point numbers are far more complicated than most people know. We’d like to treat them as real numbers, but they’re not. They are rational numbers, but don’t conserve the usual properties of operations on rational numbers. Rounding errors accumulate, and the order of operations can determine which result we get. Floating point performance varies wildly between implementations and generations of processors. It also depends on the internal architecture of the processor, where higher or lower precision numbers can sometimes have a performance impact, and sometimes it doesn’t. Being computers, it’s of course deterministic, but the nature of floating point numbers and numerical calculations on computers is extremely subtle. Bugs can be introduced in one part of the program, but will only show up in another. They can depend on the input data, so things look fine in testing, but break when released to the customer. And yet, floating point numbers as used today is nothing less than a technological miracle, and represent our best effort so far of dealing with a very difficult problem.

Liked it? Take a second to support Adored TV Staff on Patreon!
Become a patron at Patreon!