Minimizing Numeric Errors

Paul Ngugi - May 8 - - Dev Community

Using floating-point numbers in the loop continuation condition may cause numeric errors.

Numeric errors involving floating-point numbers are inevitable, because floating-point numbers are represented in approximation in computers by nature. This section discusses how to minimize such errors through an example.

The program presents an example summing a series that starts with 0.01 and ends with 1.0. The numbers in the series will increment by 0.01, as follows: 0.01 + 0.02 + 0.03, and so on.

Image description

The for loop (lines 10-11) repeatedly adds the control variable i to sum. This variable, which begins with 0.01, is incremented by 0.01 after each iteration. The loop terminates when i exceeds 1.0.

The for loop initial action can be any statement, but it is often used to initialize a control variable. From this example, you can see that a control variable can be a float type. In fact, it can be any data type.

The exact sum should be 50.50, but the answer is 50.499985. The result is imprecise because computers use a fixed number of bits to represent floating-point numbers, and thus they cannot represent some floating-point numbers exactly. If you change float in the program to double, as follows, you should see a slight improvement in precision, because a double variable holds 64 bits, whereas a float variable holds 32 bits.

// Initialize sum
double sum = 0;
// Add 0.01, 0.02, ..., 0.99, 1 to sum
for (double i = 0.01; i <= 1.0; i = i + 0.01)
 sum += i;
Enter fullscreen mode Exit fullscreen mode

However, you will be stunned to see that the result is actually 49.50000000000003. What went wrong? If you display i for each iteration in the loop, you will see that the last i is slightly larger than 1 (not exactly 1). This causes the last i not to be added into sum. The fundamental problem is that the floating-point numbers are represented by approximation. To fix the problem, use an integer count to ensure that all the numbers are added to sum. Here is the new loop:

double currentValue = 0.01;
for (int count = 0; count < 100; count++) {
 sum += currentValue;
 currentValue += 0.01;
}
Enter fullscreen mode Exit fullscreen mode

After this loop, sum is 50.50000000000003. This loop adds the numbers from smallest to biggest. What happens if you add numbers from biggest to smallest (i.e., 1.0, 0.99, 0.98, . . . , 0.02, 0.01 in this order) as follows:

double currentValue = 1.0;
for (int count = 0; count < 100; count++) {
 sum += currentValue;
 currentValue -= 0.01;
}
Enter fullscreen mode Exit fullscreen mode

After this loop, sum is 50.49999999999995. Adding from biggest to smallest is less accurate than adding from smallest to biggest. This phenomenon is an artifact of the finite-precision arithmetic. Adding a very small number to a very big number can have no effect if the result requires more precision than the variable can store. For example, the inaccurate result of 100000000.0 + 0.000000001 is 100000000.0. To obtain more accurate results, carefully select the order of computation. Adding smaller numbers before bigger numbers is one way to minimize errors.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .