- Related Questions & Answers
- Java program to multiply given floating point numbers
- C Program to Multiply two Floating Point Numbers?
- Reinterpret 64-bit signed integer to a double-precision floating point number in C#
- C++ Floating Point Manipulation
- Fixed Point and Floating Point Number Representations
- Program to find GCD of floating point numbers in C++
- Convert the specified double-precision floating point number to a 64-bit signed integer in C#
- Floating-point hexadecimal in Java
- Floating point comparison in C++
- PHP Floating Point Data Type
- Write a one line C function to round floating point numbers
- Decimal fixed point and floating point arithmetic in Python
- What are C++ Floating-Point Constants?
- Floating-point conversion characters in Java
- Format floating point number in Java

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

Were present real numbers in our daily life is not convenient for representing very small numbers, like +0.00000012347650. This same number can be more conveniently represented in scientific notation as +1.23476× 10^{−07}. But this actually stands for +0.000000123476. So there is an error of 0.00000000000005, which forms a very small percentage error.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

A signed (meaning positive or negative) digit string of a given length in a given

**base**(or**radix**).This digit string is referred to as the significand, mantissa, or coefficient.A signed integer

**exponent**which modifies the magnitude of the number.

It is important to note that floating-point numbers suffer from *loss *of *precision *when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there is an *infinite *number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2n distinct numbers. Hence, not all the real numbers can be represented. Floating number arithmetic is very much less efficient than integer arithmetic. Hence, it is better to use integers if an application does not require floating-point numbers. The nearest approximation will be used instead, resulted in the loss of accuracy. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision.

IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The mantissa is composed of the fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and need not be stored.

The following table shows the layout for the single (32-bit) and double(64-bit) precision floating-point values. The number of bits for each field is shown, followed by the bit ranges in square brackets. 00 =least-significant bit.

Sign | Exponent | Fraction | |

Single Precision | 1 [31] | 8 [30–23] | 23 [22–00] |

DoublePrecision | 1 [63] | 11 [62–52] | 52 [51–00] |

Laid out as bits, floating point numbers look like this −

Single Precision − SEEEEEEE EFFFFFFF FFFFFFFF FFFFFFFF

Double Precision − SEEEEEEE EEEEFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFFFFFFFFFF

Advertisements