Signed floating point numbers


Were present real numbers in our daily life is not convenient for representing very small numbers, like +0.00000012347650. This same number can be more conveniently represented in scientific notation as +1.23476× 10−07. But this actually stands for +0.000000123476. So there is an error of 0.00000000000005, which forms a very small percentage error.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

  • A signed (meaning positive or negative) digit string of a given length in a given base(or radix).This digit string is referred to as the significand, mantissa, or coefficient.

  • A signed integer exponent which modifies the magnitude of the number.

It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there is an infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2n distinct numbers. Hence, not all the real numbers can be represented. Floating number arithmetic is very much less efficient than integer arithmetic. Hence, it is better to use integers if an application does not require floating-point numbers. The nearest approximation will be used instead, resulted in the loss of accuracy. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision.

IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The mantissa is composed of the fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and need not be stored.

The following table shows the layout for the single (32-bit) and double(64-bit) precision floating-point values. The number of bits for each field is shown, followed by the bit ranges in square brackets. 00 =least-significant bit.


Sign
Exponent
Fraction
Single Precision
1 [31]
8 [30–23]
23 [22–00]
DoublePrecision
1 [63]
11 [62–52]
52 [51–00]

Laid out as bits, floating point numbers look like this −

Single Precision − SEEEEEEE EFFFFFFF FFFFFFFF FFFFFFFF

Double Precision − SEEEEEEE EEEEFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFFFFFFFFFF


Updated on: 27-Jun-2020

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements