Search code examples
pythonplotmatplotlibpercentileoutliers

matplotlib: disregard outliers when plotting


I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.

With matplotlib, I plot against the range [0, max_data_value]

How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?

Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?


Solution

  • There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").

    Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.

    Here's a function that implements one of the more common outlier tests.

    def is_outlier(points, thresh=3.5):
        """
        Returns a boolean array with True if points are outliers and False 
        otherwise.
    
        Parameters:
        -----------
            points : An numobservations by numdimensions array of observations
            thresh : The modified z-score to use as a threshold. Observations with
                a modified z-score (based on the median absolute deviation) greater
                than this value will be classified as outliers.
    
        Returns:
        --------
            mask : A numobservations-length boolean array.
    
        References:
        ----------
            Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
            Handle Outliers", The ASQC Basic References in Quality Control:
            Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
        """
        if len(points.shape) == 1:
            points = points[:,None]
        median = np.median(points, axis=0)
        diff = np.sum((points - median)**2, axis=-1)
        diff = np.sqrt(diff)
        med_abs_deviation = np.median(diff)
    
        modified_z_score = 0.6745 * diff / med_abs_deviation
    
        return modified_z_score > thresh
    

    As an example of using it, you'd do something like the following:

    import numpy as np
    import matplotlib.pyplot as plt
    
    # The function above... In my case it's in a local utilities module
    from sci_utilities import is_outlier
    
    # Generate some data
    x = np.random.random(100)
    
    # Append a few "bad" points
    x = np.r_[x, -3, -10, 100]
    
    # Keep only the "good" points
    # "~" operates as a logical not operator on boolean numpy arrays
    filtered = x[~is_outlier(x)]
    
    # Plot the results
    fig, (ax1, ax2) = plt.subplots(nrows=2)
    
    ax1.hist(x)
    ax1.set_title('Original')
    
    ax2.hist(filtered)
    ax2.set_title('Without Outliers')
    
    plt.show()
    

    enter image description here