Search code examples
rdistancedbscan

How do I determine the distance / eps for DBSCAN in R?


I have a dataset of points;

 lat   |long    | time
 34.53  -126.34  1
 34.52  -126.32  2
 34.51  -126.31  3
 34.54  -126.36  4
 34.59  -126.28  5
 34.63  -126.14  6
 34.70  -126.05  7
 ...

(Much larger dataset, but this is the general structure.)

I want to cluster points based on distance and time. DBSCAN seems like a good choice, since I don't know how many clusters there are.

I am using, currently, minute/5500 (which is approx 20 meters, scaled, I believe.)

library(fpc)
 results<-dbscan(data,MinPts=3,eps=0.00045,method="raw",scale=FALSE,showplot=1)

I am having a problem understanding how the scaling / distance is determined, since I have raw data. I can guess at values for eps when scaled or unscaled, but I am unclear what the scaling does, or what distance metric is being used (Euclidean distance, perhaps?) Is there documentation on this somewhere?

(This is not about finding an automated way to choose, (like Choosing eps and minpts for DBSCAN (R)? ) but about what the different values mean. Saying "You need a distance function first" doesn't explain what the distance function being used is, or how to create one...)


Solution

  • First calculate the distance matrix of your data. Then, instead of using method='row' you could use method='dist'. In this way, dbscan will treat your data as distance matrix and so no need to worry about how distance function is implemented. Note that this might require more memory since you're pre-calculating distance matrix and store it in memory.