FIXED POINT ARITHMETIC:
Fixed
point arithmetic is widely used in hardware implementations. Fixed point is a
method to describe real numbers (ones with an integer part and a fraction part)
using only integer values. Fixed point values are represented using integers
divided into integer and fractional parts.
Q_{m.n
}notation where m bits for integer portion, n bits for fractional portion
and m+n is known as Word Length(WL). Total number of bits N = m+n+1, for signed
numbers.
 Fixed point value can be calculated as:
Fixed
point value= real number * scale
 Convert from fixedpoint back into a real number:
Real
number =fixed point value/scale
 Convert a real number to fixed point number
m.n

Integer

Decimal

Scale factor

4.8

4

8

2^8=256

8.8

8

8

2^8=256

2.14

2

14

2^14=16384

 Conversion from fractional to integer value:
Step1:
Normalize decimal fractional number to the range determined by the desired Q
format.
Step2:
Multiply the normalized fractional number by 2^{n}.
Step3:
Round the product to the nearest integer.
Step4:
Write the decimal integer value in binary using N bits.
FLOATING POINT NUMBER:
The
term floating point is derived from the fact that there is no fixed number of
digits before and after the decimal point. In general, floatingpoint
representations are slower and less accurate than fixedpoint representations,
but they can hold a larger range of numbers. Floating number represented
approximately to a fixed number of significant digits and scaled using an
exponent; the base for the scaling is normally two, ten or sixteen. A number
that can be representing exactly is in the following form:
Significand
x base^{exponent }
For
example: 1.2345=12345 x 10^{4}
Floatingpoint
representation is similar in concept to scientific notation. Logically, a
floatingpoint number consists of:
 A signed (meaning negative or nonnegative) digit string of a given length in a given base or radix). The digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which number can be represented. The radix point position is assumed always to be somewhere within the significand.
 A signed integer exponent which modifies the magnitude of the number.
Nearly all hardware and programming languages use floatingpoint numbers in the same binary formats, which are defined in the IEEE754 standard. The usual formats in floating point are 32 or 64 bits in total length:
Single Precision – In this, total bits are 32, significand
bits 23+1 sign, and exponent bits 8.
Double Precision – In this, total bits 32,
significand bits 52 + 1 sign, and exponent bits 11.