Search code examples
c#mathstatisticsaveragenumerical-methods

huge numerical errors when trying to normalize data


I often process some data by programs where by data. To make is simple let us consider that data is a single series of numbers of same magnitude. When the numbers are unreasonably high it might be useful to normalize the data. One of common transformations is substracting the average from all values. After this transformation the transformed data will have the average zero.

Other common transformation which can be done after having zero average is dividing the data by their standard deviation. After appling this transformation the new data have unit variance.

When working with data normalized this way I expect that numerical errors should be smaller. However I seem to fail to do these transformations because the numerical errors appear even when I am trying to compute standard deviation.

Bellow is sample code in c# where I try to compute standard deviation. It can be easily seen even without statistical knowledge (of the formula) that the output of the program should be zero. (If data is array of constants then average of squares of data equals to square of averages.)

static double standardDeviation(double[] data)
{
    double sum = 0;
    double sumOfSquares = 0;
    foreach (double number in data)
    {
        sum += number;
        sumOfSquares += number * number;
    }
    double average = sum / data.Length;
    double averageOfSquares = sumOfSquares / data.Length;
    return Math.Sqrt(averageOfSquares - average * average);
}
static void Main(string[] args)
{
    double bigNumber = 1478340000000;
    double[] data = Enumerable.Repeat(bigNumber, 83283).ToArray();
    Console.WriteLine(standardDeviation(data));
}

Instead of zero the program outputs a huge number caused by numerical errors: 2133383.0308878

Note that if I would omit Math.Sqrt (i.e. I would be computing variance instead of standard deviation) then the error would be much higher.

What is the cause and how do I write this with smaler numerical errors?


Solution

  • Although the formula you use for variance is correct mathematically -- ie if you have infinite precision -- it can lead to trouble with finite precision.

    A better way for N data X is to compute

    variance = Sum{ square( X[i] - mean) }/ N
    

    where

    mean = Sum{ X[i] } /N
    

    As written this requires two passes through the data. If this is awkward you can in fact do it in a single pass. You need to keep three variables, n (number of data items seen so far) mean and variance. These should all be initialised to 0 (aka 0.0). Then when you get the next data item x:

    n = n + 1
    f = 1.0/n
    d = x-mean
    mean = mean + f*d
    variance = (1.0-f)*(variance + f*d*d)
    

    At each stage after processing a data item n, mean, variance are indeed the count, mean and variance of the data so far.