Deep Learning applications have highlighted the inefficiencies of the IEEE floating point format. Both Google and Microsoft have jettisoned IEEE floating point for their AI cloud services to gain two orders of magnitude better performance over their competitors. Similarly, AI applications for mobile and embedded applications have moved away from IEEE floating point to optimize performance per Watt.

However, Deep Learning applications are hardly the only applications that expose the limitations of IEEE floating point. Cloud scale, IoT, embedded, control, and HPC applications are also limited by the inefficiencies of the format. As NVIDIA, Google, and Microsoft have demonstrated, a simple change to a new number system can improve scale and cost of these applications by orders of magnitude, and create completely new application and service domains.

When performance and/or power efficiency are differentiating attributes for an application, the complexity of IEEE floats simply can't compete with number systems that are tailored to the needs of the application. Posits are a tapered floating point format, designed to replace IEEE floating point and provide a more robust computational arithmetic for the reals. The Stillwater Universal Number library provides application developers a ready-to-use arithmetic library to incorporate this new number system in their applications. To get started, simply clone the library and follow the README.

The core limitations of IEEE floating point are caused by two key problems of the format:

- inefficient representation of the reals
- inability to reproduce results across different concurrency environments

**Wasted Bit Patterns**- 32-bit IEEE floating point has around eight million ways to represent NaN (Not-A-Number), while 64-bit floating point has two quadrillion. A NaN is an exception value to represent undefined or invalid results, such as the result of a division by zero, so there is absolutely no reason for allocating that many encodings to NaN.**Mathematically Incorrect**- The format specifies two zeroes - a negative and positive zero - which behave differently.
- Loss of associative and distributive arithmetic laws due to rounding after each operation.

**Overflows to ± inf and underflows to 0**- Overflowing to ± inf increases the relative error by an infinite factor, while underflowing to 0 loses sign information.**Unused dynamic range**- The dynamic range of double precision floats is a whopping 2^2047, whereas most numerical software is architected to operate around 1.0.**Complicated Circuitry**- Denormalized floating point numbers have a hidden bit of 0 instead of 1. This creates a host of special handling requirements that complicate compliant hardware implementations.**No Gradual Overflow and Fixed Accuracy**- If accuracy is defined as the number of significand bits, IEEE floating point have fixed accuracy for all numbers except denormalized numbers because the number of signficand digits is fixed. Denormalized numbers are characterized by a decreased number of significand digits when the value approaches zero as a result of having a zero hidden bit. Denormalized numbers fill the underflow gap (i.e. the gap between zero and the least non-zero values). The counterpart for gradual underflow is gradual overflow which does not exist in IEEE floating points.

**Economical**- No bit patterns are redundant. There is one representation for infinity denoted as ± inf and zero. All other bit patterns are valid distinct non-zero real numbers. ± inf serves as a replacement for NaN.**Mathematical Elegant**- There is only one representation for zero, and the encoding is symmetric around 1.0. Associative and distributive laws are supported through deferred rounding via the quire, enabling reproducible linear algebra algorithms in any concurrency environment.**Tapered Accuracy**- Tapered accuracy is when values with small exponent have more digits of accuracy and values with large exponents have less digits of accuracy. This concept was first introduced by Morris (1971) in his paper ”Tapered Floating Point: A New Floating-Point Representation”.**Parameterized precision and dynamic range**-- posits are defined by a size,*nbits*, and the number of exponent bits,*es*. This enables system designers the freedom to pick the right precision and dynamic range required for the application. For example, for AI applications we may pick 5 or 6 bit posits without any exponent bits to improve performance. For embedded DSP applications, such as 5G base stations, we may select a 16 bit posit with one exponent bit to improve performance per Watt.**Simpler Circuitry**- There are only two special cases, Not a Real and Zero. No denormalized numbers, overflow, or underflow.

This library is a bit-level arithmetic reference implementation of the evolving Universal Number Type III (posit and valid) standard. The library provides a faithful posit arithmetic layer for any C/C++/Python environment.

As a reference library, there is extensive test infrastructure to validate the arithmetic, and there is a host of utilities to become familiar with the internal workings of posits and valids.

- Accurate
- Fully parameterized
- Header-only C++ library
- High Productivity

Header-only C++ template library makes it trivial to integrate into your computational software. Many software packages have gone before you, Eigen, MTL4, G+SMO, ODE, so you are in good company.

The library models the arithmetic at the bit-level and is the validation vehicle for our posit-enabled tensor processor hardware.

The library provides a complete set of posit configurations, ranging from the very small,
**posit<2,0>**, to the very large, **posit<256,5>**.

Just shoot us an email and we'll be glad to give you a hand with anything you need. Or just say hi!