9.1   Floating Point

This section under major construction.

One distinguishing feature that separates traditional computer science from scientific computing is its use of discrete mathematics (0s and 1s) instead of continuous mathematics and calculus. Transitioning from integers to real numbers is more than a cosmetic change. Digital computers cannot represent all real numbers exactly, so we face new challenges when designing computer algorithms for real numbers. Now, in addition to analyzing the running time and memory footprint, we must be concerned with the "correctness" of the resulting solutions. This challenging problem is further exacerbated since many important scientific algorithms make additional approximations to accommodate a discrete computer. Just as we discovered that some discrete algorithms are inherently too slow (polynomial vs. exponential), we will see that some floating point algorithms are too inaccurate (stable vs. unstable). Sometimes this problem can be corrected by designing a more clever algorithm. With discrete problems, the difficulty is sometimes intrinsic (NP-completeness). With floating point problems, the difficulty may also be inherent (ill conditioning), e.g., accurate long-term weather prediction. To be an effective computational scientist, we must be able to classify our algorithms and problems accordingly.

Floating point.

Some of the greatest achievements of the 20th century would not have been possible without the floating point capabilities of digital computers. Nevertheless, this subject is not well understood by most programmers and is a regular source of confusion. In a February, 1998 keynote address entitled Extensions to Java for Numerical Computing James Gosling asserted "95% of folks out there are completely clueless about floating-point." However, the main ideas behind floating point are not difficult, and we will demystify the confusion that plagues most novices.

IEEE 754 binary floating point representation.

First we will describe how floating point numbers are represented. Java uses a subset of the IEEE 754 binary floating point standard to represent floating point numbers and define the results of arithmetic operations. Virtually all modern computers conform to this standard. A float is represented using 32 bits, and each possible combination of bits represents one real number. This means that at most 232 possible real numbers can be exactly represented, even though there are infinitely many real numbers (even between 0 and 1). The IEEE standard uses an internal representation similar to scientific notation, but in binary instead of base 10. This covers a range from ±1.40129846432481707e-45 to ±3.40282346638528860e+38. with 6 or 7 significant decimal digits, including including plus infinity, minus infinity, and NaN (not a number). The number contains a sign bit s (interpreted as plus or minus), 8 bits for the exponent e, and 23 bits for the mantissa M. The decimal number is represented according to the following formula.
(-1)s × m × 2(e - 127)

As an example the decimal number 0.085 is stored as 00111101101011100001010001111011.

bits:    31   30-23           22-0
binary:   0 01111011 01011100001010001111011
decimal:  0    123           3019899

This exactly represents the number 2e-127 (1 + m / 223) = 2-4(1 + 3019899/8388608) = 11408507/134217728 = 0.085000000894069671630859375.

A double is similar to a float except that its internal representation uses 64 bits, an 11 bit exponent with a bias of 1023, and a 52 bit mantissa. This covers a range from ±4.94065645841246544e-324 to ±1.79769313486231570e+308 with 14 or 15 significant digits of accuracy.

Precision vs. accuracy.

Precision = tightness of specification. Accuracy = correctness. Do not confuse precision with accuracy. 3.133333333 is an estimate of the mathematical constant π which is specified with 10 decimal digits of precision, but it only has two decimal digits of accuracy. As John von Neumann once said "There's no sense in being precise when you don't even know what you're talking about." Java typically prints out floating point numbers with 16 or 17 decimal digits of precision, but do not blindly believe that this means there that many digits of accuracy! Calculators typically display 10 digits, but compute with 13 digits of precision. Kahan: the mirror for the Hubble space telescope was ground with great precision, but to the wrong specification. Hence, it was initially a great failure since it couldn't produce high resolution images as expected. However, it's precision enabled an astronaut to install a corrective lens to counter-balance the error. Currency calculations are often defined in terms of a give precision, e.g., Euro exchange rates must be quoted to 6 digits.

Roundoff error.

Programming with floating point numbers can be a bewildering and perilous process for the uninitiated. Arithmetic with integers is exact, unless the answer is outside the range of integers that can be represented (overflow). In contrast, floating point arithmetic is not exact since some real numbers require an infinite number of digits to be represented, e.g., the mathematical constants e and π and 1/3. However, most novice Java programmers are surprised to learn that 1/10 is not exactly representable either in the standard binary floating point. These roundoff errors can propagate through the calculation in non-intuitive ways.

Every time you perform an arithmetic operation, you introduce an additional error of at least ε. Kernighan and Plauger: "Floating point numbers are like piles of sand; every time you move one you lose a little sand and pick up a little dirt." If the errors occur at random, we might expect a cumulative error of sqrt(N) εm. However, if we are not vigilant in designing our numerical algorithms, these errors can propagate in very unfavorable and non-intuitive ways, leading to cumulative errors of N εm or worse.

Financial computing.

We illustrate some examples of roundoff error that can ruin a financial calculation. Financial calculations involving dollars and cents involve base 10 arithmetic. The following examples demonstrate some of the perils of using a binary floating point system like IEEE 754.

Other source of error.

In addition to roundoff error inherent when using floating point arithmetic, there are some other types of approximation errors that commonly arise in scientific applications.

Catastrophic cancellation.

Devastating loss of precision when small numbers are computed from large numbers by addition or subtraction. For example if x and y agree in all but the last few digits, then if we compute z = x - y, then z may only have a few digits of accuracy. If we subsequently use z in a calculation, then the result may only have a few digits of accuracy.

Numerical analysis.

Lloyd Trefethen (1992) "Numerical analysis is the study of algorithms for the problems of continuous mathematics."

Real-world numerical catastrophes.

The examples we've discussed above are rather simplistic. However, these issues arise in real applications. When done incorrectly, disaster can quickly strike.


Q + A

Q. Any good references on floating point?

A. Here are two articles on floating point precision: What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg and How Java's Floating-Point Hurts Everyone Everywhere co-authored by Turing award winner William Kahn. Here's Wikipedia's entry on Numerical analysis.

Q. How can I convert from IEEE bit representation to double?

A. Here is a decimal to IEEE converter. To get the IEEE 754 bit representation of a double variable x in Java, use Double.doubleToLongBits(x). According to The Code Project, to get the smallest double precision number greater than x (assuming x is positive and finite) is Double.longBitsToDouble(Double.doubleToLongBits(x) + 1).

Q. Is there any direct way to check for overflow on integer types?

A. No. The integer types do not indicate overflow in any way. Integer divide and integer remainder throw exceptions when the denominator is zero.

Q. What happens if I enter a number that is too large, e.g., 1E400?

A. Java returns the error message "floating point number too large."

Q. What about floating point types?

A. Operations that overflow evaluate to plus or minus infinity. Operations that underflows result in plus or minus zero. Operations that have no mathematically definite evaluate to NaN (not a number), e.g., 0.0/0.0, Math.sqrt(-3.0), Math.acos(3.0), Math.log(-3.0), and Math.pow(8, 1.0/3.0).

Q. How do I test whether my variable has the value NaN?

A. Use the method Double.isNaN(). Note that NaN is unordered so the comparison operations <, >, and = involving one or two NaNs always evaluates to false. Any != comparison involving NaN evaluates to true, even (x != x), when x is NaN.

Q. What's the difference between -0.0 and 0.0?

A. Both are representations of the number zero. It is true that if x = 0.0 and y = -0.0 that (x == y). However, 1/x yields Infinity, whereas 1/y yields -Infinity. Program NegativeZero.java illustrates this. It gives a surprising counterexample to the misconception that if (x == y) then (f(x) == f(y)). log 0 = -infinity, log (-1) = NaN. log(ε) vs. log(-ε).

Q. I've heard that programmers should never compare two real numbers for exact equality, but you should always use a test like if |a-b| < ε or |a-b| < ε min(|a|, |b|). Is this right?

A. It depends on the situation. If you know that the floating-point numbers are exactly representable (e.g., multiple of 1/4), it's fine to compare for exact equality. If the floating-point numbers are not exactly representable, then the relative error method is usually preferred (but can fail if the numbers are very close to zero). The absolute value method may have unintended consequences. It is not clear what value of ε to use. If a = +ε/4 and b = -ε/4, should we really consider them equal. Transitivity is not true: if a and b are "equal" and b and c are "equal", it may not be the case that a and c are "equal."

Q. What rule does Java use to print out a double?

A. By setting all the exponent bits to 1. Here's the java.lang.Double API. It always prints at least one digit after the decimal. After that, it uses as many digits as necessary (but no more) to distinguish the number from the nearest representable double.

Q. How are zero, infinity and NaN represented using IEEE 754?

A. By setting all the exponent bits to 1. Positive infinity = 0x7ff0000000000000 (all exponent bits 1, sign bit 0 and all mantissa bits 0), negative infinity = 0xfff0000000000000 (all exponent bits 1, sign bit 1 and all mantissa bits 0), NaN = 0x7ff8000000000000 (all exponent bits 1, at least one mantissa bit set). Positive zero = all bits 0. Negative zero = all bits 0, except sign bit which is 1.

Q. What is the exact value represented when I store 0.1 using double precision IEEE point.

A. It's 0.1000000000000000055511151231257827021181583404541015625. You can use System.out.println(new BigDecimal(0.1)); to see for yourself.

Q. What is StrictMath?

A. Java's Math library guarantees its results to be with 1 or to 2 ulps of the true answer. Java's StrictMath guarantees that all results are accurate to within 1/2 ulp of the true answer. A classic tradeoff of speed vs. accuracy.

Q. What is strictfp modifier?

A. It is also possible to use the strictfp modifier when declaring a class or method. This ensures that the floating point results will be bit-by-bit accurate across different JVMs. The IEEE standard allows processors to perform intermediate computations using more precision if the result would overflow. The strictfp requires that every intermediate result be truncated to 64-bit double format. This can be a major performance hit since Intel Pentium processor registers operate using IEEE 754's 80-bit double-extended format. Not many people have a use for this mode.

Q. What does the compiler flag javac -ffast-math do?

A. It relaxes some of the rounding requirements of IEEE. It makes some floating point computations faster.

Q. Are integers always represented exactly using IEEE floating point?

A. Yes, barring overflow of the 52 bit mantissa. The smallest positive integer that is not represented exactly using type double is 253 + 1 = 9,007,199,254,740,993.

Q. Does (a + b) always equal (b + a) when a and b and floating point numbers?

A. Yes.

Q. Does x / y always equal the same value, independent of my platform?

A. Yes. IEEE requires that operations (+ * - /) are performed exactly and then rounded to the nearest floating point number (using Banker's round if there is a tie: round to nearest even number). This improves portability at the expense of efficiency.

Q. Does (x - y) always equal (x + (-y))?

A. Yes. Guaranteed by IEEE 754.

Q. Is -x always the same as 0 - x?

A. Almost, except if x = 0. Then -x = -0, but 0 - x = 0.

Q. Does (x + x == 2 * x)? (1 * x == x)? (0.5 * x == x / 2)?

A. Yes, guaranteed by IEEE 754.

Q. Does x / 1000.0 always equal 0.001 * x?

A. No. For example, if x = 1138, they are different.

Q. If (x != y), is z = 1 / (x - y) always guaranteed not to produce division by zero.

A. Yes, guaranteed by IEEE 754 (using denormalized numbers).

Q. Is (x >= y) always the same as !(x < y)?

A. No if either x or y is NaN, or both are NaN.

Q. Why not use a decimal floating point instead of binary floating point?

A. Decimal floating point offers several advantages over binary floating point, especially for financial computations. However, it usually requires about 20% more storage (assuming it is stored using binary hardware) and the resulting code is somewhat slower.

Q. Why not use a fixed point representation instead of a floating point? floating point?

A. Fixed point numbers have a fixed number of digits after the decimal place. Can be represented using integer arithmetic. Floating point has a sliding window of precision, which provides a large dynamic range and high precision. Fixed point numbers are used on some embedded hardware devices that do not have an FPU (to save money), e.g., audio decoding or 3D graphics. Appropriate when data is constrained in a certain range.

Q. Any good resources for numerics in Java?

A. Java numerics provides a focal point for information on numerical computing in Java. JSci: a free science API for numerical computing in Java.

Q. Why are Java's Math.sin and Math.cos functions slower than their C counterparts?

A. In the range -pi/4 to pi/4, Java uses the hardware sin and cos functions, so they take about the same time as in C. Arguments outside this range must be converted to this range by taking them modulo pi. As James Gosling blogs on x87 platform, the hardware sin and cos use an approximation which makes the computation fast, but it is not of the accuracy required by IEEE. C implementation typically use the hardware sin and cos for all arguments, trading off IEEE conformance for speed.


  1. What is the result of the following code fragment?
    double a = 12345.0;
    double b = 1e-16;
    System.out.println(a + b == a);
  2. List all representable numbers for 6-bit floating point numbers with 1 sign bit, 3 exponent bits, and 2 mantissa bits. What is the result of the following code fragment?
  3. How many values does the following code fragment print?
    for (double d = 0.1; d <= 0.5; d += 0.1)
    for (double d = 1.1; d <= 1.5; d += 0.1)

    Answer: nine (five in the first loop and four in the second loop). The output is:

  4. What numbers are printed out when executing the following code fragment with N = 25?
    for (int i = 0; i < N; i++) {
       int counter = 0;
       for (double x = 0.0; x < i; x += 0.1)
       if (counter != 10*i) System.out.print(i + " ");

    Answer: 1 5 6 7 8 9 10 11 12 13 14 15 16 17 18. Unlikely you would predict this without typing it in. Avoid using floating point numbers to check loop continuation conditions.

  5. What is the result of the following code fragment?
    for (double t = 0.0; t < 100.0; t += 0.01)
    Answer: pretty much what you expect, but there is a lot of noise in the least significant two or three digits.
  6. What does the following code fragment print?
    Prints out 0.9200000000000002 twice!
  7. Find a value of x for which (0.1 * x) is different from (x / 10).
  8. Find a real number a such that (a * (3.0 / a)) doesn't equal 3.0. Answer: If a = 0, the first expression evaluates to NaN.
  9. Find a real number a such that (Math.sqrt(a) * Math.sqrt(a)) doesn't equal a. Answer: a = 2.0.
  10. Find floating point values such that (x/x != 1), (x - x != 0), (0 != 0 * x). Answer: consider values of x that are 0, +- infinity, or NaN.
  11. Find floating point values x and y such that x >= y is true, but !(x < y) is false.

    Hint: consider cases where either x or y (or both) have the value NaN.

  12. What is the result of the following code fragment from FloatingPoint.java?
    float f = (float) (3.0 / 7.0);
    if (f == 3.0 / 7.0) System.out.println("yes");
    else                System.out.println("no");

  13. Examples from How Reliable are Results of Computers by S. M. Rump. See program Rump.java.
    • Compute a*a - 2*b*b where a = 665857 and b = 470832 are of type float. Then compute a*a - 2.0*b*b. Repeat with type double. Java answers with float: 0.0 and 11776.0. Java answers with double: 1.0. Exact answer: 1.0.
    • Compute 9x4 - y4 + 2y2 where x = 10864 and y = 18817 are of type double. Java answer: 2.0. Exact answer: 1.0.
    • Compute p(x) = 8118x4 - 11482x3 + x2 + 5741x - 2030 where x = 0.707107 is of type float and double. Java answer for float: 1.2207031E-4. Java answer for double: -1.9554136088117957E-11. Exact answer: -1.91527325270... * 10^-7.
  14. Do you think using the type double would be a good idea for storing stock prices that are always specified in increments of 1/64. Explain why or why not. Answer: yes, it would be perfectly fine since such values are represented exactly using binary floating point. If the stock prices were represented in decimal, e.g., 45.10, that could result in roundoff error.
  15. Fix Exponential.java so that it works properly for negative inputs using the method described in the text.
  16. Sum the following real numbers from left to right. Use floating point arithmetic, and assume your computer only stores three decimal digits of precision.
    0.007 0.103 0.205 0.008 3.12 0.006 0.876 0.005 0.015

    Repeat using the priority queue algorithm. Repeat using Kahan's algorithm.

  17. Write a program to read in a list of bank account values and print the average value, rounded exactly to the nearest penny using banker's rounding. The input values will be separeated by whitespace and using two decimal digits. Hint: avoid Double.parseDouble.
      100.00   489.12   1765.12   3636.10
     3222.05  3299.20   5111.27   3542.25
    86369.18   532.99
  18. What does the following code fragment print?
    long  x = 16777217;       // 2^24 + 1
    float y = 16777217;
    DecimalFormat df = new DecimalFormat("0.00000000000");

    Answer: 16777217 and 16777216.00000000000. 2^24 + 1 is the smallest positive integer that is not exactly representable using type float.

  19. Can you find values a, b, and c such that Math.sqrt(b*b - a*c) is invalid (NaN) but (b*b < a*c) is false? Kahan.
  20. What is wrong with each of the following two loops?
    int count1 = 0;
    for (float x = 0.0f; x < 20000000.0f; x = x + 1.0f)
    int count2 = 0;
       for (double x = 0.0; x < 20000000.0; x = x + 1.0)

    Answer: The first code fragment goes into an infinite loop when x = 16777216.0f since 16777217.0f cannot be represented exactly using type float and 16777216.0f + 1.0f = 16777216.0f. The second loop is guaranteed to work correctly, but using a variable of type int for the loop variable might be more efficient.

  21. One way to define e is the limit as n approaches infinity of (1 + 1/n)n. Write a computer program to estimate e using this formula. For which value of n do you get the best approximation of e? Hint: if n gets too large, then 1 + 1/n is equal to 1 using floating point arithmetic.
  22. What's wrong with the following implementation of Math.max().
    public static double max(double x, double y) {
        if (x >= y) return x;
        return y;
    Answer: double not handle NaN properly. According to this definition max(0, NaN) is NaN but max(NaN, 0) is 0. It should be NaN.
  23. Not all subtraction is catastrophic. Demonstrate that the expression x2 - y2 exhibits catastrophic cancellation when x is close to y. However, the improved expression (x + y)(x - y) still subtracts nearly equal quantities, but it is a benign cancellation.
  24. (Goldberg) (1 + i/n)^n arises in financial calculations involving interest. Rewrite as e^(n ln (1 + i/n)). Now trick is to compute ln(1+x). Math.log(1+x) not accurate when x << 1.
    if (1 + x == 1) return x
    else return (x * Math.log(1 + x)) / (1 + x) - 1.
    Alternative in Java 1.5: Math.log1p(x)
  25. To compute e^x - 1, use Math.expm1(x). For values of x near 0, Math.expm1(x) + 1 is more accurate than Math.exp(x).
  26. Catastrophic cancellation: f(x) = ex - sin x - cos x near x = 0. Use x2 + x3/3 as an approximation near x = 0.
  27. Catastrophic cancellation: f(x) = ln(x) - 1. For x near e use ln(x/e).
  28. Catastrophic cancellation: f(x) = ln(x) - log(1/x). For x near 1 use 2 ln(x).
  29. Catastrophic cancellation: x-2(sinx - ex + 1).
  30. Find a value of x for which Math.abs(x) doesn't equal Math.sqrt(x*x).

  31. Numerical stability. The golden mean φ = sqrt(5)/2 - 1/2 = 0.61803398874989484820... It satisfies the equation φN = φN-2 - φN-1. We could use this recurrence to compute power of φ by starting with phi0 = 1, phi1 = φ, and iterating the recurrence to calculate successive powers of φ. However, floating-point roundoff error swamps the computation by about N = 40. Write a program Unstable.java that reads in a command line parameter N and prints out φN computed using the recurrence above and also using Math.pow(). Run your program with N = 40 and N = 100 to see the consequences of roundoff error.

Creative Exercises

  1. Roundoff errors. Sums. Relative error = |x - x'| / |x| = independent of units. One idea is to create priority queue of numbers and repeatedly add the two smallest values and insert back into the priority queue. (See exercise XYZ.) Simpler and faster alternative: Kahan's summation formula (Goldberg, Theorem 8). Example of useful algorithm that exploits fact that (x + y) + z doesn't equal x + (y + z). Optimizing compiler better not rearrange terms.
    double c = 0.0, sum = 0.0, y;
    for (int i = 0; i < N; i++)
      y = term[i] - c;
      c = ((sum + y) - sum) - y;
      sum = t;

    Recommended method: sorting and adding.

  2. Cauchy-Schwartz inequality. The Cauchy-Schwartz inequality guarantees that for any real numbers xi and yi that
    (x1x1 + ... xnxn) × (y1y1 + ... ynyn)  ≥ (x1y1 + ... xnyn)2

    In particular, when n = 2, y1 = y2 = 1, we have

    2(x1x1 + x2x2) ≥ (x1 + x2)2

    Write a program CauchySchwartz.java that reads in two integers x1 and x2, and verifies that the identity is not necessarily true in floating point. Try x1 = 0.5000000000000002 and x2 = 0.5000000000000001

  3. Chebyshev approximation for distance computations. Use the following approximation for computing the distance between p and q without doing an expensive square root operation. sqrt(dx2 + dy2) = max(|dx|, |dy|) + 0.35 * min(|dx|, |dy|). How accurate.
  4. Pythagoras. Write a program Program Pythagoras.java that reads in two real-valued command line parameters a and b and prints out sqrt(a*a + b*b). Try to stave off overflow. For example, if |a| >= |b| then compute |a| sqrt(1 + (b/a)*(b/a)). If |a| < |b| do analogous. Ex: x = y = DBL_MAX / 10.0;

    Reliable implementation since Java 1.5: Math.hypot(a, b).

  5. Overflow. What is the result of the following code fragment?
    double a = 4.0, b = 0.5, c = 8e+307;
    System.out.println((a * b) * c);
    System.out.println(a * (b * c));
  6. Optimizing compiler. An optimizing compiler.... Must be wary about blindly applying algebraic laws to computer programs since this could lead to disastrous results. Show that with floating point numbers, addition is not associative. That is find values a, b, and c such that ((a + b) + c) != (a + (b + c)). Show also that the distributive property does not necessarily apply finding a, b, and c such that (a * (b + c)) != (a * b + a * c). Problem more pronounced with multiplication ((a * b) * c) != (a * (b * c)). Give simple example.
  7. Integer arithmetic. Integer addition and multiplication are provably associative in Java, e.g., (a+b)+c always equals a+(b+c).
  8. Harmonic sum. Redo the harmonic sum computation, but sum from right-to-left instead of left-to-right. It illustrates that addition is not associative (a + b) + c vs. a + (b + c) since we get different answers depending on whether we sum from left-to-right or right-to-left.
  9. Pi. Compare the following two methods for approximating the mathematical constant pi. Repeat while p_k > p_k-1. The sequence p_k converges to pi.
    s_6  = 1                               s_6  = 1
    s_2k = sqrt(2 - sqrt((2-s_k)(2+s_k))   s_2k = s_k / sqrt(2 + sqrt((2-s_k)(2+s_k))
    p_k  = k * s_k / 2                     p_k  = k * s_k / 2
    The second formula has much better behavior since it avoids the catastrophic cancellation 2 - sqrt((2-s_k)(2+s_k)).
  10. Jean-Michael Muller example. Consider the sequence x0 = 4, x1 = 4.25, xn = 108 - (815 - 1500/xn-2) / xn-1. Program Muller.java attempts to estimate what xn converges to as n gets large. It computes the value 100.0, but the correct value is 5.
  11. Another Kahan example. Myth: if you keep adding more precision and answer converges then it's correct. Counterexample: E(0) = 1, E(z) = (exp(z) - 1) / z. Q(x) = |x - sqrt(x2 + 1)| - 1/(x + sqrt(x2 + 1)) H(x) = E(Q(x)2). Compute H(x) for x = 15.0, 16.0, 17.0, 9999.0. Repeat with more precision, say using BigDecimal.
  12. Dividing rational numbers (a + ib) /(c + id). See Smith's formula (11).
  13. With integer types we must be cognizant of overflow. The same principles of overflow apply to floating point numbers. For example, to compute sqrt(x*x + y*y), use fabs(y) * sqrt(1+(x/y)*(x/y)) if x < y and analog if x > y. x = y = DBL_MAX / 10.0;
  14. Gamma function. Write a program Gamma.java that takes a command line parameter x and prints Gamma(x) and log Gamma(x) to 9 significant digits, where Gamma(x) is the gamma function:

    Gamma(x) = integral( tx-1 e-t, t = 0..infinity )

    The Gamma function is a continuous version of the factorial function: if n is a positive integer, then n! = Gamma(n+1). Use Lanczos' formula:

    Gamma(x) ≈ (x + 4.5)x - 1/2 * e-(x + 4.5) * sqrt(2 π) *
        [ 1.0 + 76.18009173    / (x + 0)   - 86.50532033    / (x + 1)
              + 24.01409822    / (x + 2)   -  1.231739516   / (x + 3)
              +  0.00120858003 / (x + 4)   -  0.00000536382 / (x + 5)

    This is accurate to 9 significant digits for x > 1.

  15. Siegfried M. Rump example. Write a program Remarkable.java to compute
    y = 333.75 b6 + a2(11 a2 b2 - b6 - 121 b4 - 2) + 5.5 b8 + a/(2b)

    for a = 77617 and b = 33096. Try with different floating point precision. Can rewrite

    x = 5.5 b8
    z = 333.75 b6 + a2(11 a2 b2 - b6 - 121 b4 - 2)
    y = z + x + a/(2b)

    It turns out that x and z have 35 digits in common.

    z = -7919111340668961361101134701524942850
    x = +7919111340668961361101134701524942848

    The answer we get in Java (on Sun Sparc) using single precision is 6.338253 ×1029 and using double precision is a/(2b) = 1.1726039400531787. The true answer is -2 + a/(2b) = -54767/66192 = -0.82739606...., but the z + x = -2 term is catostrophically cancelled unless you are using at least 122 digits of precision! We match this answer using the BigDecimal ADT in Java's Math library.

    The paper A Remarkable Example of Catastrophic Cancellation Unraveled describes this well-known formula, which demonstrates why obtaining identical results in different precisions does not mathematically imply accuracy. Moreover, even on IEEE 754 conforming platforms (Intel, Sun Sparc), you can get different answers depending on whether the intermediate rounding mode is 24, 53, or 64 bits. This remarkable example was constructed by Siegfried M. Rump at IBM for S/370 architectures.

  16. Area of a triangle. Write a program TriangleArea.java that takes three command line inputs a, b, and c, representing the side lengths of a triangle, and prints out the area of the triangle using Heron's formula: area = sqrt(s(s-a)(s-b)(s-c)), where s = (a + b + c) / 2. Inaccurate when a is close to b. Improve with 1960s era formula: sort a >= b >= c. if (c + b < a) then not a legal triangle. 1/4 * sqrt((a+(b+c)) * (c-(a-b)) * (c+(a-b)) *(a+(b-c))). Ex: a = b = 12345679.0, c = 1.01233995.
  17. Quadratic formula. The quadratic equation is a grade school formula for finding the real roots of ax2 + bx + c.
    discriminant = Math.sqrt(b*b - 4*a*c);
    root1 = -b + discriminant / (2*a);
    root2 = -b - discriminant / (2*a);

    A novice programmer who needed to calculate the roots of a quadratic equation might naively apply this formula. Of course, we should be careful to handle the a = 0 case specially to avoid division by zero. Also we should check the discriminant b2 - 4ac. If it's negative, then there are no real roots. However, the major problem with this formula is that it doesn't work if a or c are very small because one of the two roots will be computed by subtracting b from a very nearly equal quantity. The correct way to compute the roots is

    if (b > 0) q = -0.5 * (b + discriminant);
    else       q = -0.5 * (b - discriminant);
    root1 = q / a;
    root2 = c / q;

    For example the roots of x2 - 3000000x + 2 are r1 and r2. The grade school formula yields only 3 digits of accuracy for the smaller root (6.665941327810287E-7) when using double precision arithmetic, whereas the good formula yields 12 digits of accuracy (6.666666666668148E-7). When b = -30000000, the situation deteriorates. The grade school method now only has two digits of accuracy (6.705522537231445E-8) vs. 12 digits of accuracy (6.666666666666681E-8). When b = -300000000, the grade school method has zero digits of accuracy (0.0) as opposed to 16 (6.666666666666667E-9).

    Consider summarizing this in a table.

  18. Beneficial cancellation. Massive cancellations in subtraction do not always lead to catastrophic cancellation. Cancellation error does not always lead to inaccuracy. Straightforward implementation of f(x) = (x-1) / (e^(x-1) - 1) not accurate if x is close to 1. Surprisingly, the solution below achieves full working accuracy regardless of the size of x. Final computed answer more accurate than intermediate results through extreme cleverness in helpful cancellation! Error analysis can be very subtle and non-obvious.
    double y = x - 1.0;
    double z = Math.exp(y);
    if (z == 1.0) return z;
    return Math.log(z) / (z - 1.0);
    Program BeneficialCancellation.java implements the straightforward and numerically accurate approaches.

    See How Java's Floating-Point Hurts Everyone Everywhere.

  19. 1 - cos(x). Compare computing 1 - cos(x) naively and using the formula cos(x) = 2 sin(x/2)^2. The Right Way to Calculate Stuff.
  20. Square root cancellation. Devise an accurate expression for the difference between the square root of (b2 + 100) and b. Solution: If b is negative or relatively small, just do the obvious thing. If b is much larger than 100, you can get catastrophic cancellation. The difference equals b (sqrt(1 + 100/b2) - 1). If x is very small then, it's safe to approximate sqrt(1 + x) by 1 + x/2. Now, the difference is approximately b (100 / b2) / 2 = 50 / b.