java algorithm statistics standard-deviation online-algorithm

Online algorithm for calculating standard deviation of counts

Normally, I have a more technical problem but I will simplify it for you with an example of counting balls.

Assume I have balls of different colors and one index of an array (initialized to all 0's) reserved for each color. Every time I pick a ball, I increment the corresponding index by 1.

Balls are picked randomly and I can only pick one ball at a time. My sole purpose is to count number of balls for every color, until I run out of balls.

I would like to calculate standard deviation of the number of balls of different colors, while I am counting them. I do not want to calculate it by having to iterate through the array once more after I am done with counting all the balls.

To visualize:

Balls in random order: BBGRRYYBBGGGGGGB (each letter represents first letter of a color) Array indices from 0 to 3 correspond to colors B, G, R and Y respectively. When I am done picking the balls, my array looks like [5,7,2,2].

It is very simple to calculate standard deviation after having the final array but I want to do it while I am filling this array.

I want to do it in Java and I have approximately 1000 colors.

What is the most efficient way to implement that? Or is there even a way to do it before having the final array in hand?

Solution

Since average and standard deviation are calculated using sums, you can easily implement appropriate accumulators for these. Then when you want the actual values, finish the rest of the calculation (particularly, the divisions).

The sum of squares is the tricky part since you increment one of the frequencies for each input. One way to deal with this is to maintain a count of each color seen so far (using an appropriate data structure). Then when you see a color in the input, you can subtract out its previous square and add the new square back in (or equivalently add the difference of the two squares to your accumulator).

I'll leave it to the reader to implement the algorithm described here.