I've got a dataframe that looks like this:
long lat site
-141.37 61.13 x1
-149.1833 66.7333 x2
-149.667 67.667 x3
-141.3667 61.1157 x4
I want to calculate the distances between all of the site
's using distVincentyEllipsoid
. Then for those sites that are located within 5km distance from each other, I want to modify the site
name to include both sites. So, in this example x1
and x4
are within 5km from each other, so it will be like this:
long lat site
-141.37 61.13 x1_x4
-149.1833 66.7333 x2
-149.667 67.667 x3
-141.3667 61.1157 x1_x4
I know I can calculate a matrix between all site
's in this way:
df %>% dplyr::select('long', 'lat')
distm(df, fun = distVincentyEllipsoid)
But I don't know how to take it from there.
It is helpful if you provide the example data as R code, like this
x <- matrix(c(-141.37, 61.13, -149.1833, 66.7333, -149.667, 67.667, -141.3667, 61.1157), ncol=2, byrow=TRUE)
colnames(x) <- c("lon", "lat")
x <- data.frame(site=paste0("x", 1:4), x)
but thank you for showing the expected output
Solution:
As you suggested, first make a distance matrix. Then classify that as within the threshold distance or not, and then use the rows to select the records. Note that I use distGeo
--- it is a better method than distVincentyEllipsoid
.
library(geosphere)
m <- distm(x[, c("lon", "lat")], fun=distGeo)
m <- m < 5000
x$name <- apply(m, 1, function(i) paste(x$site[i], collapse="_"))
x
# site lon lat name
#1 x1 -141.3700 61.1300 x1_x4
#2 x2 -149.1833 66.7333 x2
#3 x3 -149.6670 67.6670 x3
#4 x4 -141.3667 61.1157 x1_x4
If you have many points the distance matrix may become too large. In that case you could do
y <- x[, c("lon", "lat")]
for (i in 1:nrow(y)) {
j <- distGeo(y[i, ], y) < 5000
x$name[i] <- paste(x$site[j], collapse="_")
}
or like this
y <- x[, c("lon", "lat")]
x$name <- x$site
for (i in 1:nrow(y)) {
j <- distGeo(y[i, ], y) < 5000
if (any(j)) {
x$name[i] <- paste(x$site[j], collapse="_")
}
}