How to understand the trimmed mean in Scipy

I can't explain the behaviour of trim_mean() in Scipy.stats.

I learned that trimmed mean calculates the average of a series of numbers after discarding given parts of a probability distribution.

In the following example, I got the result as 6.1111

from scipy.stats import trim_mean

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
trim_percentage = 0.05  # Trim 5% from each end

result = trim_mean(sorted(data), trim_percentage)
print(f"result = {result}")

result = 6.111111111111111

However, I expect that 1 and 30 will be cut out, because they fall under the 5 percentile and above the 95 percentile.

When I do it manually:

import numpy as np

data = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p5, p95 = np.percentile(data, [5, 95])
print(f"The 5th percentile = {p5}\nThe 95th percentile = {p95}")

trim_average = np.mean(list(filter(lambda x: x if p5 < x < p95 else 0, data)))
print(f"trimmed average = {trim_average}")

I got this:

The 5th percentile = 1.4

The 95th percentile = 19.999999999999993

trimmed average = 3.4285714285714284

Does this mean the trim_mean() treats each number separately and assumes a uniform distribution? The proportiontocut is explained as "Fraction to cut off of both tails of the distribution". But why it behaves like if the distribution were not considered?

Solution

The phrasing in the documentation should be more precise: it cuts a fraction of the observations in your sample. You have 9 values, and 5% of 9 values is 0.45 values. However, it can't cut off a fraction of a value. The documentation states that it

Slices off less if proportion results in a non-integer slice index

So in your case, zero values are cut from both ends before taking the mean.

import numpy as np
from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
np.mean(x)  # 6.111111111111111
stats.trim_mean(x, 0.05)  # 6.111111111111111

You can verify that the result changes when proportiontocut exceeds 1/len(data):

from scipy import stats
x = [1, 2, 2, 3, 4, 30, 4, 4, 5]
p = 1 / len(x)
eps = 1e-15
stats.trim_mean(x, p-eps)  # 6.111111111111111
stats.trim_mean(x, p+eps)  # 3.4285714285714284

This behavior appears to be consistent with the description of a trimmed mean on Wikipedia, at least:

This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points... For example, given a set of 8 points, trimming by 12.5% would discard the minimum and maximum value in the sample: the smallest and largest values, and would compute the mean of the remaining 6 points.

SciPy does not have a function that trims based on percentiles (of which there are many conventions). For that, you'd need to write your own function, or perhaps there is such a function in another library.

Please consider opening an issue about improving the documentation.