python machine-learning cluster-analysis

DBSCAN eps and min_samples

I have been trying to use DBSCAN in order to detect outliers, from my understanding DBSCAN outputs -1 as outlier and 1 as inliner, but after I ran the code, I'm getting numbers that are not -1 or 1, can someone please explain why? Also is it normal to find the best value of eps using trial and error, because I couldn't figure out a way to find the best possible eps value.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.cluster import DBSCAN



df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)


# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']



#applying DBSCAN


DBSCAN = DBSCAN(eps = 1.8, min_samples =10,n_jobs=-1)

df['anomaly'] = DBSCAN.fit_predict(df)


np.unique(df['anomaly'],return_counts=True)

(array([  -1,    0,    1, ..., 8462, 8463, 8464]),
array([1737565, 3539278, 4455734, ...,      13,       8,       8]))

Thank you.

Solution

Well, you did not actually get the real idea of DBSCAN.

This is a copy from wikipedia:

A point p is a core point if at least minPts points are within distance ε of it (including p).

A point q is directly reachable from p if point q is within distance ε from core point p. Points are only said to be directly reachable from core points.

A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi. Note that this implies that all points on the path must be core points, with the possible exception of q.

All points not reachable from any other point are outliers or noise points.

So saying in easier words, The idea is that:

Any sample who has min_samples neigbours by the distance of epsilon is a core sample.
Any data sample which is not core, but has at least one core neighbor (with a distance less than eps), is a directly reachable sample and can be added to the cluster.
Any data sample which is not directly reachable nor a core, but has at least one directly reachable neighbor (with a distance less than eps) is a reachable sample and will be added to the cluster.
Any other examples are considered to be noise, outlier or whatever you want to name it.( and those will be labeled by -1)

Depending on the parameters of the clustering (eps and min_samples) , you are very likely to have more than two clusters. You see, that is the reason you are seeing other values than 0 and -1 in the result of your clustering.

To answer your second question

Also is it normal to find the best value of eps using trial and error,

If you mean doing cross-validation( over a set where you know the cluster labels or you can approximate the correct clustering), yes I think that is the normal way to do it

PS: The paper is very good and comprehensive. I highly suggest you have a look. Good luck.