Search code examples
rgeosphere

Alter a column based on results from geodistance matrix


I've got a dataframe that looks like this:

long       lat       site
-141.37    61.13     x1
-149.1833  66.7333   x2
-149.667   67.667    x3
-141.3667  61.1157   x4

I want to calculate the distances between all of the site's using distVincentyEllipsoid. Then for those sites that are located within 5km distance from each other, I want to modify the site name to include both sites. So, in this example x1 and x4 are within 5km from each other, so it will be like this:

 long      lat       site  
-141.37    61.13     x1_x4    
-149.1833  66.7333   x2
-149.667   67.667    x3
-141.3667  61.1157   x1_x4

I know I can calculate a matrix between all site's in this way:

df %>% dplyr::select('long', 'lat')
distm(df, fun = distVincentyEllipsoid)

But I don't know how to take it from there.


Solution

  • It is helpful if you provide the example data as R code, like this

    x <- matrix(c(-141.37, 61.13, -149.1833, 66.7333, -149.667, 67.667, -141.3667, 61.1157), ncol=2, byrow=TRUE)
    colnames(x) <- c("lon", "lat")
    x <- data.frame(site=paste0("x", 1:4), x)
    

    but thank you for showing the expected output

    Solution:

    As you suggested, first make a distance matrix. Then classify that as within the threshold distance or not, and then use the rows to select the records. Note that I use distGeo --- it is a better method than distVincentyEllipsoid.

    library(geosphere)
    m <- distm(x[, c("lon", "lat")], fun=distGeo)
    
    m <- m < 5000
    x$name <- apply(m, 1, function(i) paste(x$site[i], collapse="_"))
    x
    #  site       lon     lat    name
    #1   x1 -141.3700 61.1300   x1_x4
    #2   x2 -149.1833 66.7333      x2
    #3   x3 -149.6670 67.6670      x3
    #4   x4 -141.3667 61.1157   x1_x4
    

    If you have many points the distance matrix may become too large. In that case you could do

    y <- x[,  c("lon", "lat")]
    for (i in 1:nrow(y)) {
       j <- distGeo(y[i, ], y) < 5000
       x$name[i] <- paste(x$site[j], collapse="_")
    } 
    

    or like this

    y <- x[,  c("lon", "lat")]
    x$name <- x$site    
    for (i in 1:nrow(y)) {
       j <- distGeo(y[i, ], y) < 5000
       if (any(j)) {
           x$name[i] <- paste(x$site[j], collapse="_")
       }
    }