Search code examples
rdata-manipulation

Join many to one: combine related characteristics


I have a dataframe where each row represents a spatial unit. The nbid* variables indicate which unit is a neighbour. I would like to get the dum variable of the neighbour into the main dataframe. (Instead of spatial units it could be any kind of relations within a dataframe - business partners, relatives, related genes etc.) Some simplified data look like this:

seed(999)
df_base <- data.frame(id = seq(1:100),
                 dum= sample(c(rep(0,50), rep(1,50)),100),
                 nbid_1=sample(1:100,100),
                 nbid_2=sample(1:100,100),
                 nbid_3=sample(1:100,100)) %>% 
  mutate(nbid_1 =  replace(nbid_1, sample(row_number(), size = ceiling(0.1 * n()), replace = FALSE), NA),
         nbid_2 =  replace(nbid_2, sample(row_number(), size = ceiling(0.3 * n()), replace = FALSE), NA),
         nbid_3 =  replace(nbid_3, sample(row_number(), size = ceiling(0.7 * n()), replace = FALSE), NA))

(In these simplified data and other than in the real data, neighbours 1,2 and 3 can be the same, but that does not matter for the question.)

My approach was to duplicate and then join the data, which would look like this:

df1 <- df_base
df2 <- df_base %>% 
  select(-c(nbid_1,nbid_2,nbid_3)) %>% 
  rename(nbdum=dum)

df <- left_join(df1,df2,by=c("nbid_1"="id")) %>% 
  rename(nbdum1=nbdum) %>% 
  left_join(.,df2,by=c("nbid_2"="id")) %>% 
  rename(nbdum2=nbdum) %>% 
  left_join(.,df2,by=c("nbid_3"="id")) %>% 
  rename(nbdum3=nbdum)

df is the result that I am looking for - from here I can create an overall neighbour dummy or a count. This approach is however neither elegant nor feasible to implement with the real data which has many more neighbours.

How can I solve this in a less clumsy way?

Thanks in advance for your ideas!!


Solution

  • A key clue is that when you see var_1, var_2, ..., var_n, it suggests that the data can be transformed to be longer. See pivot_longer() or data.table::melt() where molten data is discussed frequently.

    For your example, we can pivot and then join the df2 table back. I am unsure if the format is needed but after the join, we can pivot back to wide with pivot_wider().

    library(dplyr)
    library(tidyr)
    
    
    df1 %>%
      select(!id) %>%
      pivot_longer(cols = starts_with("nbid"), names_prefix = "nbid_")%>%
      mutate(original_id = rep(1:100, each = 3))%>%
      left_join(df2, by = c("value" = "id"))%>%
      pivot_wider(original_id, values_from = c(value, nbdum))
    
    #> # A tibble: 100 × 7
    #>    original_id value_1 value_2 value_3 nbdum_1 nbdum_2 nbdum_3
    #>          <int>   <int>   <int>   <int>   <dbl>   <dbl>   <dbl>
    #>  1           1      25      90      23       0       0       1
    #>  2           2      12      NA      NA       1      NA      NA
    #>  3           3      11      40      47       0       0       0
    #>  4           4      94      87      NA       0       1      NA
    #>  5           5      46      77      NA       1       0      NA
    #>  6           6      98      82      NA       1       0      NA
    #>  7           7      43      NA      NA       1      NA      NA
    #>  8           8      74      NA       7       0      NA       1
    #>  9           9      57      NA      NA       1      NA      NA
    #> 10          10      49      72      NA       0       0      NA
    #> # … with 90 more rows
    
    ## compare to original
    
    as_tibble(df)
    #> # A tibble: 100 × 8
    #>       id   dum nbid_1 nbid_2 nbid_3 nbdum1 nbdum2 nbdum3
    #>    <int> <dbl>  <int>  <int>  <int>  <dbl>  <dbl>  <dbl>
    #>  1     1     0     25     90     23      0      0      1
    #>  2     2     1     12     NA     NA      1     NA     NA
    #>  3     3     1     11     40     47      0      0      0
    #>  4     4     1     94     87     NA      0      1     NA
    #>  5     5     0     46     77     NA      1      0     NA
    #>  6     6     1     98     82     NA      1      0     NA
    #>  7     7     1     43     NA     NA      1     NA     NA
    #>  8     8     0     74     NA      7      0     NA      1
    #>  9     9     0     57     NA     NA      1     NA     NA
    #> 10    10     0     49     72     NA      0      0     NA
    #> # … with 90 more rows