Search code examples
pythonmeantrimscipy.stats

How to understand the trimmed mean in Scipy


I can't explain the behaviour of trim_mean() in Scipy.stats.

I learned that trimmed mean calculates the average of a series of numbers after discarding given parts of a probability distribution.

In the following example, I got the result as 6.1111

from scipy.stats import trim_mean

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
trim_percentage = 0.05  # Trim 5% from each end

result = trim_mean(sorted(data), trim_percentage)
print(f"result = {result}")

result = 6.111111111111111

However, I expect that 1 and 30 will be cut out, because they fall under the 5 percentile and above the 95 percentile.

When I do it manually:

import numpy as np

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p5, p95 = np.percentile(data, [5, 95])
print(f"The 5th percentile = {p5}\nThe 95th percentile = {p95}")

trim_average = np.mean(list(filter(lambda x: x if p5 < x < p95 else 0, data)))
print(f"trimmed average = {trim_average}")

I got this:

The 5th percentile = 1.4

The 95th percentile = 19.999999999999993

trimmed average = 3.4285714285714284

Does this mean the trim_mean() treats each number separately and assumes a uniform distribution? The proportiontocut is explained as "Fraction to cut off of both tails of the distribution". But why it behaves like if the distribution were not considered?


Solution

  • The phrasing in the documentation should be more precise: it cuts a fraction of the observations in your sample. You have 9 values, and 5% of 9 values is 0.45 values. However, it can't cut off a fraction of a value. The documentation states that it

    Slices off less if proportion results in a non-integer slice index

    So in your case, zero values are cut from both ends before taking the mean.

    import numpy as np
    from scipy import stats
    x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
    np.mean(x)  # 6.111111111111111
    stats.trim_mean(x, 0.05)  # 6.111111111111111
    
    

    You can verify that the result changes when proportiontocut exceeds 1/len(data):

    from scipy import stats
    x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
    p = 1 / len(x)
    eps = 1e-15
    stats.trim_mean(x, p-eps)  # 6.111111111111111
    stats.trim_mean(x, p+eps)  # 3.4285714285714284
    

    This behavior appears to be consistent with the description of a trimmed mean on Wikipedia, at least:

    This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points... For example, given a set of 8 points, trimming by 12.5% would discard the minimum and maximum value in the sample: the smallest and largest values, and would compute the mean of the remaining 6 points.

    SciPy does not have a function that trims based on percentiles (of which there are many conventions). For that, you'd need to write your own function, or perhaps there is such a function in another library.

    Please consider opening an issue about improving the documentation.