Understanding Floating Point Numbers and Precision in the Context of Large Language Models (LLMs)

10 min readMay 24, 2024

Artificial intelligence (AI) is becoming a part of every aspect of our lives, especially through the use of large language models (LLMs) like GPT and image generation models like Stable Diffusion. As these models become more popular, there’s a growing concern about their computational and environmental costs, particularly as we know that their use in consumer as well as enterprise is going to increase.

The Importance of Numerical Representation in LLMs

The main factors driving the cost of running LLMs are the size and complexity of the models, the processors they run on, and the numerical representation of data. Modern models have been increasing in size, with their computational needs doubling every 6–10 months. Processor power has not kept pace with these growing demands, leading researchers to focus on optimizing numerical representation to reduce costs. The choice of data type affects power consumption, accuracy, and speed of a model. This is crucial during both the training and inference phases of LLMs.

Floating Point Numbers: The Basics

A floating point number is a way to represent real numbers in computing that can handle a wide range of values efficiently. The term “floating point” refers to the ability of the decimal point to “float,” or move, allowing for representation of both very large and very small numbers.

Structure of Floating Point Numbers

A floating-point number consists of three parts:

Sign Bit: Indicates whether the number is positive or negative.
Exponent: Determines the scale or magnitude of the number, setting the position of the decimal (or binary) point.
Mantissa (Significand): Represents the significant digits of the number.

To further deepen the understanding of how floating-point numbers are represented in computers using the IEEE 754 standard, let’s dissect the example of representing the number -6.25 in a 32-bit (single precision) format. This example will illustrate the conversion process step by step, making it easier to grasp the underlying principles.

Floating Point Representation: The Curious Case of -6.25

A floating-point number in computers is broken down into three main components: the sign bit, the exponent, and the mantissa as explained above. The IEEE 754 standard outlines how these components are used to represent real numbers in binary form.

1. Sign Bit

The sign bit is the simplest component, indicating whether the number is positive or negative. It uses a single bit, where:

0 represents a positive number.
1 represents a negative number.

For our example, -6.25, the sign bit is 1, indicating that the number is negative.

2. Exponent

The exponent component scales the number, determining its magnitude. The process involves several steps:

Step 1: Convert to Binary

First, convert the absolute value of the number to binary. For 6.25:

6 in binary is 110
0.25 in binary is 0.01 (because 0.25 = 1/4 = 2^(-2))
Combining these, the binary representation of 6.25 is 110.01.

Step 2: Normalize

Convert the binary number to a normalized form where there is only one non-zero digit to the left of the binary point.

What is Normalization?

Normalization is the process of adjusting the binary number so that it fits the format (1.xxxx times 2^n). This format is useful because it ensures a consistent representation of numbers in floating-point arithmetic.

How to Normalize the Binary Number:

a) Identify the Binary Point’s Initial Position:

The binary representation of 6.25 is 110.01.
The binary point is initially between the integer part (110) and the fractional part (01).

b) Move the Binary Point:

To normalize, we need to move the binary point so that there is only one non-zero digit to the left of it.
For 110.01, we move the binary point two places to the left, so it becomes 1.1001.

c) Determine the Exponent:

The number of places we move the binary point determines the exponent.
In this case, we moved the binary point two places to the left. Therefore, the exponent is 2.

Normalized Form:

After normalization, the binary number 110.01 becomes 1.1001.
The exponent is 2 because we moved the binary point two places to the left.

Thus, the normalized form is 1.1001 and the exponent is 2.

Step 3: Apply Bias

IEEE 754 uses a biased exponent to accommodate both positive and negative exponents. For 32-bit floats, the bias is 127. Therefore, the actual exponent stored is (2 + 127 = 129). (Refer to the bottom section if you are curious to know how bias is 127)

Step 4: Binary Representation

Finally, convert the biased exponent to binary. For 129, the binary representation is 10000001.

3. Mantissa (Significand)

The mantissa represents the significant digits of the number, excluding the leading bit (since it’s always 1 for normalized numbers).

Fractional Part: After normalization, the fractional part of 1.1001 is 1001. This is what gets stored in the mantissa.
Filling the Mantissa: The mantissa in a 32-bit float has 23 bits. Since our fractional part 1001 only occupies 4 bits, the remaining bits are filled with zeros, resulting in 10010000000000000000000.

Combining the Components

Putting it all together, the 32-bit IEEE 754 representation of -6.25 is constructed by combining the sign bit, the exponent, and the mantissa:

Sign Bit: 1
Exponent: 10000001
Mantissa: 10010000000000000000000

Therefore, the complete 32-bit binary representation of -6.25 is:

Sign | Exponent | Mantissa

1 | 10000001 | 10010000000000000000000

Floating Point Precision in Python (NumPy) and PyTorch

Python, by default, uses double precision (float64) for floating point numbers. This provides high precision, suitable for general-purpose programming tasks.

import numpy as np

x = 0.1
print(x)  # Output: 0.1 (approximation of 0.1)
print(np.array([0.1]).dtype)  # Output: float64 <==

PyTorch, a popular framework for building LLMs, uses single precision (float32) by default for tensors. This balances precision with computational efficiency, especially on GPUs where float32 operations are faster.

import torch

x = torch.tensor([0.1])
print(x)  # Output: tensor([0.1000]) - approximation of 0.1
print(x.dtype)  # Output: torch.float32

Practical Example: Summing Floats

Floating point precision is a critical concept in numerical computing. Due to the way floating point numbers are represented in binary, some decimal numbers cannot be represented exactly, leading to small errors in calculations. Let’s consider summing 0.1 ten times in both Python (using numpy) and PyTorch to illustrate this issue.

import numpy as np

sum_python = sum([0.1 for _ in range(10)])
print(sum_python)  # May not be exactly 1 due to precision issues
print(np.array([0.1]).dtype)  # Output: float64

Double Precision (float64): Python’s float64 format uses 64 bits to store the number. This includes 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa. This format can represent numbers with up to approximately 15–17 decimal digits of precision.
Precision Issue: The decimal number 0.1 cannot be represented exactly in binary. Instead, it is stored as an approximation. When you sum 0.1 ten times, these small errors accumulate, leading to a result that is not exactly 1.

Now lets repeat the exercise with PyTorch

import torch

sum_pytorch = torch.tensor([0.1 for _ in range(10)]).sum()
print(sum_pytorch)  # Should be close to 1 due to float32 precision
print(sum_pytorch.item())  # Get the actual value as a Python number
print(sum_pytorch.dtype)  # Output: torch.float32

Single Precision (float32): PyTorch’s float32 format uses 32 bits to store the number. This includes 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. This format can represent numbers with up to approximately 7 decimal digits of precision.
Precision Issue: Similar to numpy, 0.1 cannot be represented exactly in binary in the float32 format. The precision is lower than float64, meaning the approximation error is larger. When summing 0.1 ten times, the accumulated error is still present but float32 can still produce results close to 1.
Out of the two which one is closer to one, meaning which one is more precise?

Why Does This Happen?

Floating point numbers are represented in a way that can only approximate most decimal fractions. For example, the decimal fraction 0.1 has no exact binary representation. Instead, it is stored as a binary fraction that is very close to 0.1. When performing arithmetic operations, these small approximation errors can accumulate, leading to results that are not exact.

Binary Representation of 0.1: In binary, 0.1 is an infinite repeating fraction. Double precision (float64) and single precision (float32) store only a finite number of bits, leading to a small approximation error.
Accumulation of Errors: When you sum 0.1 ten times, the small approximation error in each 0.1 adds up. Generally speaking, in float64, the error should be smaller, so the result is closer to 1. In float32, the error is relatively larger, so the result deviates more from 1.

Quantization: Enhancing Efficiency in LLMs

Quantization reduces the number of bits needed to represent the weights of a network, making the model smaller, reducing computation time, and lowering power consumption. Typically, LLMs are trained using single precision 32-bit floating point (FP32) data types. However, not all 32 bits are necessary to maintain accuracy.

Quantization Techniques

1. Post-Training Quantization (PTQ):

Applied after training a model with high precision.
Converts weights and activations to lower precision (e.g., FP8 or INT8).
Simple but can cause accuracy loss.

2. Quantization-Aware Training (QAT):

Incorporates quantization during training.
Simulates quantized operations during forward and backward passes.
Better accuracy but more complex.

The Debate: INT8 vs. FP8

The AI industry is debating between INT8 and FP8 as preferred data types for quantized models. Each has its advantages and trade-offs.

INT8

Representation: 1 sign bit, 1 exponent bit, 6 mantissa bits.
Advantages: Higher precision, suitable for specific hardware.
Disadvantages: Smaller dynamic range, may not be optimal for all models.

FP8

Flavors: FP8 E3M4, FP8 E4M3, FP8 E5M2.
Advantages: Higher dynamic range, suitable for training and inference.
Disadvantages: Varies in precision depending on the application and hardware support.

Impact on Large Language Models (LLMs)

Large language models (LLMs), such as those used in natural language processing (NLP), are often trained using single precision (float32) to balance precision and computational efficiency. For deployment and inference, models can be further reduced or quantized using lower precision formats like float16 or int8 or even int4.

Example of Precision and Memory Trade-off

Consider a large language model with 100 million parameters:

Float32:

- Each parameter requires 4 bytes.

- Total memory required = 100 million * 4 bytes = 400 million bytes (400 MB).

Float16:

- Each parameter requires 2 bytes.

- Total memory required = 100 million * 2 bytes = 200 million bytes (200 MB).

By reducing the precision from float32 to float16, we save 200 MB of memory. However, this comes at the cost of reduced precision, which might impact the model’s performance on certain tasks.

Conclusion

Understanding floating point numbers and their precision is crucial for developers, researchers and data scientists, especially when working with numerical computations in LLMs. While Python offers higher precision with float64, PyTorch opts for the more efficient float32, highlighting the trade-off between precision and performance. Being aware of these differences and their implications can help in designing more robust and accurate computational models.

For large language models, choosing the right precision is a balance between computational efficiency and model performance. With the advent of techniques like quantization, it’s possible to deploy highly efficient models without significantly compromising their capabilities. As the AI industry continues to evolve, finding the optimal numerical representation will remain a key area of research and development.

Additional Resources and Examples

If you are curious to know how -6.25 would be represented in fp64 (double precision) and fp8 (half precison), here it is :

64-bit (Double Precision) Representation of -6.25

In double precision, the IEEE 754 standard allocates 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa.

Sign Bit: 1 (since the number is negative)
Exponent: The exponent for -6.25 is 2, and with a bias of 1023 (used in double precision), the stored exponent is 2 + 1023 = 1025. The binary representation of 1025 is 10000000001.
Mantissa: The fractional part after normalization is 1001. In double precision, we have 52 bits for the mantissa, so after placing 1001, the rest are filled with zeros, resulting in 1001000000000000000000000000000000000000000000000000.

Combining these components, the 64-bit representation is:

Sign | Exponent | Mantissa

1 | 10000000001 | 1001000000000000000000000000000000000000000000000000

16-bit (Half Precision) Representation of -6.25

In half precision, the IEEE 754 standard allocates 1 bit for the sign, 5 bits for the exponent, and 10 bits for the mantissa.

Sign Bit: 1 (since the number is negative)
Exponent: The exponent for -6.25 is 2, and with a bias of 15 (used in half precision), the stored exponent is 2 + 15 = 17. The binary representation of 17 is 10001.
Mantissa: The fractional part after normalization is 1001. In half precision, we have 10 bits for the mantissa, so after placing 1001, the rest are filled with zeros, resulting in 1001000000.

Combining these components, the 16-bit representation is:

Sign | Exponent | Mantissa

1 | 10001 | 1001000000

Why does the IEEE 754 standard uses 127 bias for fp32?

Those of you who want more information around bias in IEEE 754 for different floating point standards refer to this link : 127 bias

Understanding Floating Point Numbers and Precision in the Context of Large Language Models (LLMs)

1. Sign Bit

2. Exponent

3. Mantissa (Significand)

Floating Point Precision in Python (NumPy) and PyTorch

Practical Example: Summing Floats

Why Does This Happen?

Quantization: Enhancing Efficiency in LLMs

The Debate: INT8 vs. FP8

Conclusion

Additional Resources and Examples

Why does the IEEE 754 standard uses 127 bias for fp32?

Written by Dhananjay Kumar

Responses (1)