Search code examples
pythonalgorithmnumpyscipysampling

Finding appropriate cut-off values


I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:

Given x - a sorted list of numbers and m - the median of x, I need to find a such that approximately 70% of the values in x fall into the range (m-a; m+a). We know nothing about the distribution of values in x. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.

EDIT

Seth suggested to use scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

The output is 0.79 (~80%, instead of 70)


Solution

  • You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard scipy.stats functions on this one-sided distribution:

    from scipy.stats import scoreatpercentile
    import numpy as np
    
    theList = np.log10(1+np.arange(.1, 100))
    theMedian = np.median(theList)
    
    oneSidedList = theList[:]               # copy original list
    # fold over to the right all values left of the median
    oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]
    
    # find the 70th centile of the one-sided distribution
    a = scoreatpercentile(oneSidedList, 70) - theMedian
    
    #check how many elements fall into the range
    sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
    
    print np.sum(sel) / float(len(theList))
    

    This gives the result of 0.7 as required.