This course is still very much in the rough!
This course belongs to the track Numerical Algorithms in the Department of Scientific Computing in the School of Computer Science.
In this course, students will learn how numbers are represented on modern-day computers, where the limits and problems of such numbers are and how they may be overcome.
On binary computers, whole numbers (Integers) are stored in bit arrays of fixed size. These arrays contain the binary representation of the decimal number.
The decimal number 37, for instance, can be converted to binary and stored in an 8-bit variable
The range of numbers that can be stored in bits is .
This trivial scheme works rather well, since binary arrays are integer by nature. However, mathematics is usually not restricted to integer arithmetic, so a different representation has to be found for real-valued numbers, such as , , and the like.
The most trivial approach for real-valued numbers would be to pack them into two integers, one representing the integer part, the other the fraction.
Notice that the fractional part is not , yet . The th digit to the right of the "." in binary is
Using such a representation -- generally called a fixed-point representation -- results in very simple operations (addition, multiplication, division, etc...) yet it's range would be limited from
where and are the number of bits in the integer and fractional parts respectively. This is not a very wide range.
Furthermore, for large numbers, we are carrying around a lot of lower-order bits which we probably don't need since, as we will see later, we are usually interested in relative accuracy (i.e. the first few digits of a number) as opposed to the absolute accuracy of the last digits of the fractional part. Likewise, for small numbers of the order of , we are only precise up to a few digits although we are carrying around a bunch of zeros we don't really need.
Instead of using two number additively, as in the example above, we could represent our real numbers as the product of a fixed precision number and a factor of 2
This is equivalent to shifting by bits. The "." in can therefore be set arbitrarily. for all practical purposes, we set it after the first digit.
This is the concept of modern floating-point numbers as described in IEEE 754. The range of numbers that can be represented is from to .
TODO: Show how sum, multiplication and division are computed.
4+3+4+2=3 (mod 5)
Show representation for single and double precision, sizes of mantissa and exponent
First bit of mantissa is always 1, normalized numbers, exponent is biased for easier arithmetic.
Explain different rounding modes.
Concept of . Definition as and spacing between numbers 1.0 and 2.0.
Realmin and Realmax
Show how these values can be computed in Matlab, C and Pascal
Note on problem with extra bits (90 bits) on Intel Pentiums.
Basic explanation of the problem with an example, preferrably with a figure.
Explain how it can be avoided by sorting or with Kahan summation.
Give an example illustrating the problem.
No trick, problems need to be re-formulated to avoid subtracting numbers of similar magnitude.
Show some examples how this can be done.