Search code examples
pythonpandasoutliersdbscan

Detect outliers or noise data in each group in Python


I'm working for a data which have 3 columns: type, x, y, let's say x and y are correlated and they not normalizedly distributed, I want groupby type and filter outliers or noise data points in x and y. Could someone recommend me statitics or machine learning methods to filter outliers or noise data? How can I do that in Python?

I'm considering to use DBSCAN from scikit-learn, is it appropriate method ?

Type1: type1 Type2: enter image description here Type3: enter image description here

df1 = df.loc[df['type'] == '3']

data= df1[["x", "y"]]
data.plot.scatter(x = "x", y = "y")

from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
  eps = 0.5,
  metric="euclidean",
  min_samples = 3,
  n_jobs = -1)
clusters = outlier_detection.fit_predict(data)

from matplotlib import cm
cmap = cm.get_cmap('Accent')
data.plot.scatter(
  x = "iSearchCount",
  y = "iGuaPaiCount",
  c = clusters,
  cmap = cmap,
  colorbar = False
)

enter image description here


Solution

  • For this type of data and outliers I would recommend a statistical approach. The SPE/DmodX (distance to model) or Hotelling T2 test may help you here. I do not see the data for the 3 types but I generated some.

    These methods are available in the pca library. With the n_std you can adjust the ellipse "width".

    pip install pca
    
    import pca
    results = pca.spe_dmodx(X, n_std=3, showfig=True)
    
    # If you want to test the Hotelling T2 test.
    # results1 = pca.hotellingsT2(X, alpha=0.001)
    

    example results

    results is a dictionary and contains the labels of the outliers.