Search code examples
rcluster-analysisdbscan

r - DBSCAN (Density Based Clustering) describe unit of measure for eps


I was trying to use the dbscan package in R to try to cluster some spatial data. The dbscan::dbscan function takes eps and minpts as input. I have a dataframe with two columns longitude and latitude expressed in degree decimals like in the following:

df <- data.frame(lon = c(seq(1,5,1), seq(1,5,1)), 
                   lat = c(1.1,3.1,1.2,4.1,2.1,2.2,3.2,2.4,1.4,5.1))

and I apply the algorithm:

 db <- fpc::dbscan(df, eps = 1, MinPts = 2)

will eps here be defined in degrees or in some other unit ? I'm really trying to understand in which unit this maximum distance eps value is expressed so any help is appreciated


Solution

  • Never use the fpc package, always use dbscan::dbscan instead.

    If you have latitude and longitude, you need to choose an appropriate distance function such as Haversine.

    The default distance function, Euclidean, ignores the spherical nature of earth. The eps value then is a mixture of degrees latitude and longitude, but these do not correspond to uniform distances! One degree east at the equator is much farther than one degree east in Vancouver.

    Even then, you need to pay attention to units. One implementation of Haversine may yield radians, another one meters, and of course someone crazy will work in miles.

    Unfortunately, as far as I can tell, none of the R implementations can accelerate Haversine distance. So it may be much faster to cluster the data in ELKI instead (you need to add an index yourself though).

    If your data is small enough, you can however use a precomputed distance matrix (dist object) in R. But that will take O(n²) time and memory, so it is not very scalable.