I was trying to use the dbscan
package in R to try to cluster some spatial data. The dbscan::dbscan
function takes eps
and minpts
as input. I have a dataframe with two columns longitude
and latitude
expressed in degree decimals like in the following:
df <- data.frame(lon = c(seq(1,5,1), seq(1,5,1)),
lat = c(1.1,3.1,1.2,4.1,2.1,2.2,3.2,2.4,1.4,5.1))
and I apply the algorithm:
db <- fpc::dbscan(df, eps = 1, MinPts = 2)
will eps
here be defined in degrees or in some other unit ? I'm really trying to understand in which unit this maximum distance eps
value is expressed so any help is appreciated
Never use the fpc
package, always use dbscan::dbscan
instead.
If you have latitude and longitude, you need to choose an appropriate distance function such as Haversine.
The default distance function, Euclidean, ignores the spherical nature of earth. The eps value then is a mixture of degrees latitude and longitude, but these do not correspond to uniform distances! One degree east at the equator is much farther than one degree east in Vancouver.
Even then, you need to pay attention to units. One implementation of Haversine may yield radians, another one meters, and of course someone crazy will work in miles.
Unfortunately, as far as I can tell, none of the R implementations can accelerate Haversine distance. So it may be much faster to cluster the data in ELKI instead (you need to add an index yourself though).
If your data is small enough, you can however use a precomputed distance matrix (dist
object) in R. But that will take O(n²) time and memory, so it is not very scalable.