Search code examples
pythonperformancenumpymeanmedian

Calculate mean and median efficiently


What is the most efficient way to sequentially find the mean and median of rows in a Python list?

For example, my list:

input_list = [1,2,4,6,7,8]

I want to produce an output list that contains:

output_list_mean = [1,1.5,2.3,3.25,4,4.7]
output_list_median = [1,1.5,2.0,3.0,4.0,5.0]

Where the mean is calculated as follows:

  • 1 = mean(1)
  • 1.5 = mean(1,2) (i.e. mean of first 2 values in input_list)
  • 2.3 = mean(1,2,4) (i.e. mean of first 3 values in input_list)
  • 3.25 = mean(1,2,4,6) (i.e. mean of first 4 values in input_list) etc.

And the median is calculated as follows:

  • 1 = median(1)
  • 1.5 = median(1,2) (i.e. median of first 2 values in input_list)
  • 2.0 = median(1,2,4) (i.e. median of first 3 values in input_list)
  • 3.0 = median(1,2,4,6) (i.e. median of first 4 values in input_list) etc.

I have tried to implement it with the following loop, but it seems very inefficient.

import numpy

input_list = [1,2,4,6,7,8]

for item in range(1,len(input_list)+1):
    print(numpy.mean(input_list[:item]))
    print(numpy.median(input_list[:item]))

Solution

  • Anything you do yourself, especially with the median, is either going to require a lot of work, or be very inefficient, but Pandas comes with built-in efficient implementations of the functions you are after, the expanding mean is O(n), the expanding median is O(n*log(n)) using a skip list:

    import pandas as pd
    import numpy as np
    
    input_list = [1, 2, 4, 6, 7, 8]
    
    >>> pd.expanding_mean(np.array(input_list))
    array([ 1.     ,  1.5    ,  2.33333,  3.25   ,  4.     ,  4.66667])
    
    >>> pd.expanding_median(np.array(input_list))
    array([ 1. ,  1.5,  2. ,  3. ,  4. ,  5. ])