Search code examples
pythonmeanupdatesvarianceonline-algorithm

Efficient algorithm for online Variance over image batches


I have a large amount of images and want to calculate the variance (of each channel) across all of them. Memory wise I cannot create a large array/tensor to calculate the variance of all color channels at the same time. I therefore need to load the Images in batches and need to update the current variance somehow in an online way after each batch. However, I struggle to find an efficient (and even correct) algorithm for this.

I found the Welford's online algorithm, however it is not efficient as it updates the variance for a single new value, i.e. does not vectorize across multiple values like a single image or a batch of images.

How can I update the variance for multiple new observations in an efficient way, i.e. by using vectorization or making use of inbuilt variance algorithms?


Solution

  • These are the two functions needed to update/combine the mean and variances of two batches. Both functions can be used with vectors (the color channels). The mean and variance can be acquired from inbuilt methods like batch.var().

    Equations taken from: https://notmatthancock.github.io/2017/03/23/simple-batch-stat-updates.html

    # m amount of samples (or pixels) over all previous batches
    # n amount of samples in new incoming batch
    # mu1 previous mean
    # mu2 mean of current batch
    # v1 previous variance
    # v2 variance of current batch
    
    def combine_means(mu1, mu2, m, n):
        """
        Updates old mean mu1 from m samples with mean mu2 of n samples.
        Returns the mean of the m+n samples.
        """
        return (m/(m+n))*mu1 + (n/(m+n))*mu2
    
    def combine_vars(v1, v2, mu1, mu2, m, n):
        """
        Updates old variance v1 from m samples with variance v2 of n samples.
        Returns the variance of the m+n samples.
        """
        return (m/(m+n))*v1 + n/(m+n)*v2 + m*n/(m+n)**2 * (mu1 - mu2)**2
        
    

    As you see one can simplify them a bit by reusing some calculations like m+n but keeping it in this pure form for better understanding.