Search code examples
python-2.7cluster-analysishierarchical-clusteringoutliersdbscan

Given a dataset with Normal values and outliers, is there any standard way to find a normalised value of epsilon for implementing DBSCAN.


I am working on my personal implementation of DBSCAN on some data, but I have problems when I have to find epsilon dynamically for every kind of data set I have to use, because average value of epsilon before implementing DBSCAN considers the outliers as well, and hence the resultant epsilon has the effect of the outlier value as well, which is problematic for me. Is there any way to counter this?

This is the part of the code which calculates the epsilon for the specific dataset:

xmax = np.max(X,axis = 0)
xmin = np.min(X,axis = 0)
min_max = xmax-xmin
k = 10
eps = (min_max[0]*min_max[1]*k/(len(X)*math.pi))**0.5

I have used some functions like max, min from the numpy module.


Solution

  • If finding the appropriate value of epsilon is a major problem, the real problem may be long before that: you may be using the wrong distance measure all the way, or you may have a preprocessing problem.

    Your code looks a lot like a naive preprocessing approach - and that is how good it will work.

    Also read the DBSCAN paper. The authors propose a way of choosing epsilon in section 4.2, and you may be able to automate this...