Search code examples
rgeospatialknn

Working with spatial data: How to find the nearest neighbour of points without replacement?


I am currently working with some forest inventory data. The data were collected on sample plots whose positions are available as point data (spatial data).

I have two datasets:

  • dataset dat.1 with n sample plots of species A
  • dataset dat.2 with k sample plots of species B

with n < k

What I want to do is to match every point of dat.1 with a point of dat.2. The result should be n pairs of points. So n of k plots from dat.2 should be selected.

The criteria for matching are:

  • spatial distance between a pair of points is as close as possible
  • one point of dat.2 can only be matched with one point in dat.1 and vice versa. So if there is a pair of points, these points should not be used in any other pair, even if it would be useful in terms of shortest distance. The "occupied" points should not be replaced and should not be used in the further matching process.

I have been looking for a very long time for ways to perform this analysis. There are functions like st_nn from 'nngeo' or nn2 from 'RANN' which give out the k nearest neighbours of a point. However, it is not possible to exclude the possibility of a replacement with these functions.

In the package 'matchIt' there are possibilites to perform a nearest neighbour matching without replacement. Yet these functions are adapted to find the closest distance between control variables and not between spatial locations.

Could anyone come up with an idea for a possibility to match my requirements? I would really appreciate any hints or suggestions for packages and / or functions that could help me with this issue.


Solution

  • The first thing you should do is create your own distance matrix. The rows should correspond to those in dat.1 and the columns to those in dat.2, and each entry in the matrix is the distance between the plot in the row and the plot in the column. You can do this manually by looping through your datasets and computing the Euclidean (or other) distance between the points. You can also use the match_on function in the optmatch package to do this with the following code:

    d <- rbind(dat.1, dat.2)
    d$dat <- c(rep(1, nrow(dat.1)), rep(0, nrow(dat.2))
    dist <- optmatch::match_on(dat ~ x.coor + y.coord, data = d,
                               method = "euclidean")
    

    Once you have a distance matrix in this form, you can supply it to pairmatch in the optmatch package. pairmatch performs K:1 optimal matching without replacement. The matching is optimal in that the sum of the absolute distances between matched pairs in the matched sample is as low as possible. It doesn't guarantee that any one unit will get its nearest neighbor, but it does yield matched samples that ensure no units are matched to other units too far apart from them. You can specify an argument to controls to choose how many dat.2 units you want to be matched to each dat.1 unit. For example, to match 2 plots from dat.2 to each unit in dat.1, you can use

    d$pairs <- optmatch::pairmatch(dist)
    

    The output is a factor containing pair membership for each unit. Unmatched units will have a value of NA.

    You can also do this in one single step with

    d$pairs <- optmatch::pairmatch(dat ~ x.coor + y.coord, data = d,
                                   method = "euclidean")
    

    Then you can subset your dataset so only matched plots remain:

    matched <- d[!is.na(d$pairs),]