Single vs double precision
In this article, let’s explore the IEEE-754 floating-point formats, to store a real number in this format. The IEEE-754 is a standard for representing and manipulating floating-point quantities that are followed by all modern computer systems and microcontrollers.
For example, consider the above number. This is a very big number. How do you save this number in your memory? So, you can’t save this by converting this into its binary equivalent, that would consume a lot of computer memory. That’s why this standard says don’t store this number as it is.
Instead of that, approximate this number and then only store the required information, such as the Sign, the Exponent, and then the Mantissa part, so that what we call as significand, as shown in Figure 2.
In this format, the number will be approximated, and it will be saved in the memory.
Now the next question is, how many bits are required to store all these things?
There are two formats. One is a single-precision format, which consumes 32 bits to store all this information such as Significand, Exponent, and the Sign.
In this format, 23 bits are given to Significand, 8 bits are given to Exponent, and 1 bit is given to Sign. So, what we call single-precision representation.
And there is also one more representation called double precision, which consumes 64 bits. So, double precision does a higher level of approximation; that is, the result that we get from double-precision implementation is more accurate compared to the single-precision implementation or single precision storage.
Here, as you can see Figure 4, 52 bits are given to store the significand, 11 bits are used to store the Exponent, and 1 bit is used to store the Sign. But it consumes double the memory of single-precision storage.
Consider the above number. That number has the Integer part, Decimal point, and Fractional part.
Let’s see how we use this single-precision and double-precision storage in’ C’ programming. For that, you have to use some special data types. For example, if we want to store this number in memory, you cannot use integer data types such as int, char, and long. So, if you use int, char, or long to store this data only the integer part will be stored. So, you will lose the fractional part.
Instead of that, use the data types which are used to represent these decimal numbers: float and double. Now the float is for 32-bit floating-point representation, which is single precision, and double is for 64-bit floating-point representation, which is double precision.
Format specifier for float and double data types
Now, let’s explore the Format specifier for float and double data types when we do inputting or outputting these decimal numbers through our ‘C’ program.
- Use %lf format specifier to read or write double type variable
- Use %f format specifier to read or write float type variable
- Use %e or %le format specifier to read or write real numbers in scientific notation. So, %e for float scientific notation and %le for double scientific notation.
- All constants with a decimal point are considered as double by default by the compiler.
FastBit Embedded Brain Academy Courses
Click here: https://fastbitlab.com/course1