Search code examples
machine-learningcluster-analysisoutliersanomaly-detectionhdbscan

Anomalies Detection by DBSCAN


I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my code

from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)

X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)

Q-1 Out of these 7 some are discrete and some are continuous , is it ok to scale discrete and continuous both or just continuous? Q-2 Do i need to map cluster to test data as it learned from training?


Solution

  • DBSCAN will handle those outliers for you. That's what is was built for. See the example below and post back if you have additional questions.

    import seaborn as sns
    import pandas as pd
    titanic = sns.load_dataset('titanic')
    titanic = titanic.copy()
    titanic = titanic.dropna()
    titanic['age'].plot.hist(
      bins = 50,
      title = "Histogram of the age variable"
    )
    
    from scipy.stats import zscore
    titanic["age_zscore"] = zscore(titanic["age"])
    titanic["is_outlier"] = titanic["age_zscore"].apply(
      lambda x: x <= -2.5 or x >= 2.5
    )
    titanic[titanic["is_outlier"]]
    
    ageAndFare = titanic[["age", "fare"]]
    ageAndFare.plot.scatter(x = "age", y = "fare")
    
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    ageAndFare = scaler.fit_transform(ageAndFare)
    ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
    ageAndFare.plot.scatter(x = "age", y = "fare")
    
    from sklearn.cluster import DBSCAN
    outlier_detection = DBSCAN(
      eps = 0.5,
      metric="euclidean",
      min_samples = 3,
      n_jobs = -1)
    clusters = outlier_detection.fit_predict(ageAndFare)
    clusters
    
    from matplotlib import cm
    cmap = cm.get_cmap('Accent')
    ageAndFare.plot.scatter(
      x = "age",
      y = "fare",
      c = clusters,
      cmap = cmap,
      colorbar = False
    )
    

    enter image description here