Search code examples
rmergespatialtemporal

Fuzzy merging two data sets by Lat, Lon and time


I have two data sets, the fire data set is huge and the global temp data set is quite a bit smaller than it.

So I would like to match the two data sets by DISCOVERY_DATE = date, Latitude = latitude and longitude = longitude. Now i know most of them will not be a match but i am looking just for as close as match as possible. I think fuzzyjoin would be a good way to go about this but how would one match all three with this.

Im thinking the issue may be that I cant seem to find a good function for this.

 tempFire <- fuzzy_join(fires, Temps, multi_by = c("DISCOVERY_DATE" = "date", "LONGITUDE" = "Longitude", "LATITUDE" = "Latitude"), multi_match_fun = D, mode = "full")

Data

> head(z, n =10)
   fires.LATITUDE fires.LONGITUDE fires.DISCOVERY_DATE
1        40.03694       -121.0058           1970-01-29
2        38.93306       -120.4044           1970-01-29
3        38.98417       -120.7356           1970-01-29
4        38.55917       -119.9133           1970-01-29
5        38.55917       -119.9331           1970-01-29
6        38.63528       -120.1036           1970-01-29
7        38.68833       -120.1533           1970-01-29
8        40.96806       -122.4339           1970-01-29
9        41.23361       -122.2833           1970-01-29
10       38.54833       -120.1492           1970-01-29
    > head(b, n = 10)
   Temps.Latitude Temps.Longitude Temps.date
1           32.95         -100.53 1992-01-01
2           32.95         -100.53 1992-02-01
3           32.95         -100.53 1992-03-01
4           32.95         -100.53 1992-04-01
5           32.95         -100.53 1992-05-01
6           32.95         -100.53 1992-06-01
7           32.95         -100.53 1992-07-01
8           32.95         -100.53 1992-08-01
9           32.95         -100.53 1992-09-01
10          32.95         -100.53 1992-10-01

Solution

  • I would recommend that you come up with an appropriate distance metric based on a weighted combination of temporal distance (i.e. subtracting the dates) and spatial distance (based on lat & long). Determine the weights based on the relative importance of spatial and temporal proximity for your application. Then compute a matrix containing the distance from every point in the first data set to every point in the second data set using this distance metric. Finally, find the minimum distance in each row and/or column to select data points in one dataset that are closest to the points in the other data set. You will probably want to discard any pairs with a distance greater than some threshold.