Search code examples
pythondataframescipykernel-densityanomaly-detection

Get sparse region of KDE


I have an array of 20k real numbers, and I use pd.DataFrame(scores).plot.kde(figsize=(24,8)) to get the below kernel density estimation. How can I purely programmatically select the indexes of the sparse regions, or conversely the dense region?

My current approach is of the form np.where(scores > np.percentile(scores, 99))[0], I am very of such hard cording of 99 as it may not work too well in production. A potential solution which I'm not sure how to approach is selecting the indices where the Density is below 20,000

image


Solution

  • Which region to consider "sparse" and which "dense" can be very subjective. It also heavily depends on the signification of the data. An idea is to decide upon some cut-off percentiles. The example below uses the lowest 0.1 % and highest 99.9 %.

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'score': np.random.randn(2000, 10).cumsum(axis=0).ravel()})
    df['score'].quantile([.01, .99])
    ax = df.plot.kde(figsize=(24, 8))
    ax.axvline(df['score'].quantile(.001), color='crimson', ls=':')
    ax.axvline(df['score'].quantile(.999), color='crimson', ls=':')
    ax.set_ylim(ymin=0) # avoid the kde "floating in the air"
    plt.show()
    

    example plot