Search code examples
rjoindplyrspatialterra

Joining two dataframes with location data (lon, lat) and same resolution


I have two dataframes (df_1 and df_2) containing one variable each (var1 and var2). They are both global maps, in the same crs and at the same spatial resolution (0.083 on the global grid): Var1 Var2. If it can help, I created both dataframes starting from NetCDF files at different spatial resolutions and extents, using package terra.

I need the two variables to be in the same dataframe to perform analyses on them. However, I've been trying to merge the dataframes based on spatial coordinates (columns lon and lat) but it returns a column of NAs.

  1. I tried first with join from dplyr:
> df <- df_1 %>%
+   left_join(df_2, by = c("lon", "lat"))
> summary(df)
      lon               lat               var1             var2        
 Min.   :-179.96   Min.   :-55.458   Min.   : 0.000   Min.   : NA      
 1st Qu.: -63.04   1st Qu.:  8.208   1st Qu.: 9.398   1st Qu.: NA      
 Median :  31.21   Median : 37.791   Median :15.052   Median : NA      
 Mean   :  20.81   Mean   : 30.864   Mean   :16.223   Mean   :NaN      
 3rd Qu.:  93.12   3rd Qu.: 56.791   3rd Qu.:21.898   3rd Qu.: NA      
 Max.   : 179.96   Max.   : 83.041   Max.   :62.238   Max.   : NA      
                                                      NA's   :2075418  
  1. I converted them to sf and then used st_join:
> sf_1 <- st_as_sf(df_1,
+                    coords = c("lon", "lat"),
+                    crs = 4326)
> sf_2 <- st_as_sf(df_2,
+                    coords = c("lon", "lat"),
+                    crs = 4326)
> df_spatialjoin <- st_join(sf_1, sf_2)
> summary(df_spatialjoin)
      var1             var2                  geometry      
 Min.   : 0.000   Min.   : NA       POINT        :2075418  
 1st Qu.: 9.398   1st Qu.: NA       epsg:4326    :      0  
 Median :15.052   Median : NA       +proj=long...:      0  
 Mean   :16.223   Mean   :NaN                              
 3rd Qu.:21.898   3rd Qu.: NA                              
 Max.   :62.238   Max.   : NA                              
                  NA's   :2075418                      
  1. I tried to convert the sf back to tibble and then use dplyr:
> df_join <- left_join(as_tibble(sf_1),
+                      as_tibble(sf_2),
+                      by = "geometry")
> summary(df_join)
      var1                 geometry            var2        
 Min.   : 0.000   POINT        :2075418   Min.   : NA      
 1st Qu.: 9.398   epsg:4326    :      0   1st Qu.: NA      
 Median :15.052   +proj=long...:      0   Median : NA      
 Mean   :16.223                           Mean   :NaN      
 3rd Qu.:21.898                           3rd Qu.: NA      
 Max.   :62.238                           Max.   : NA      
                                          NA's   :2075418  

I also tried geo_join from fuzzyjoin as explained in this thread but it also didn't work.

I had this problem in the past, but I would solve it using cbind() instead of any join() function. Unfortunately, here cbind() won't work as the two dataframes originate from different NetCDF files and have thus different row numbers.


Solution

  • This worked for me. I think there was a problem with the original maps with which the dataframes were created. So I did:

    library(terra)
    library(tidyterra)
    library(tidyverse)
    
    # transform dataframes back to maps
    map_1 <- as_spatraster(df_1, xycols = 1:2, crs = 4326, digits = 6)
    map_2 <- as_spatraster(df_2, xycols = 1:2, crs = 4326, digits = 6)
    plot(map_int)
    
    # resample one dataframe to match the other
    map_1 <- terra::resample(map_1, map_2)
    
    # convert both maps back to dataframes
    df_1 <- terra::as.data.frame(map_1, xy = TRUE)
    df_2 <- terra::as.data.frame(map_2, xy = TRUE)
    
    # merge with dplyr
    df <- df_1 %>%
      inner_join(df_2,
                by = c("x", "y"))
    

    I hope this could help other people that had the same problem.