Search code examples
rdplyrrowwiseacross

Look across vectorised pairs to return matching values


Related to a question I asked here: Find if a date overlaps between multiple pairs of vectorised dates

Example initial data:

  person start_loc start_date end_date.1 end_date.2 end_date.3 end_loc.1 end_loc.2 end_loc.3
1      1         a 2021-02-10 2021-02-17       <NA>       <NA>         g                    
2      2         a 2021-01-30 2020-09-29 2020-12-12 2021-02-04         a         a         g
3      3         g 2020-12-04       <NA>       <NA>       <NA>                              
4      4         r 2020-12-09 2020-12-12 2020-12-14 2021-01-05         c         c         g
5      5         t 2021-03-22 2021-03-25 2021-03-29       <NA>         b         t          
6      6         b 2021-04-04 2021-04-07 2021-04-09       <NA>         b         t          

example <- structure(list(person = 1:6, start_loc = c("a", "a", "g", "r", 
"t", "b"), start_date = structure(c(18668, 18657, 18600, 18605, 
18708, 18721), class = "Date"), end_date.1 = structure(c(18675, 
18534, NA, 18608, 18711, 18724), class = "Date"), end_date.2 = structure(c(NA, 
18608, NA, 18610, 18715, 18726), class = "Date"), end_date.3 = structure(c(NA, 
18662, NA, 18632, NA, NA), class = "Date"), end_loc.1 = c("g", 
"a", "", "c", "b", "b"), end_loc.2 = c("", "a", "", "c", "t", 
"t"), end_loc.3 = c("", "g", "", "g", "", "")), class = "data.frame", row.names = c(NA, 
-6L))

My data is arranged so that I have rows for each person and a start_date, together with a start_loc. I want to find out which people had an

  • end_date within 7 days of the start_date,
  • and where there are two or more pairs which meet this criteria, prioritise those whose end_loc matches their start_loc
  • else take the earliest.

So the desired output would look something like:

 person start_loc start_date end_date.1 end_date.2 end_date.3 end_loc.1 end_loc.2 end_loc.3   end_date end_loc
1      1         a 2021-02-10 2021-02-17       <NA>       <NA>         g                     2021-02-17       g
2      2         a 2021-01-30 2020-09-29 2020-12-12 2021-02-04         a         a         g       <NA>        
3      3         g 2020-12-04       <NA>       <NA>       <NA>                                     <NA>        
4      4         r 2020-12-09 2020-12-12 2020-12-14 2021-01-05         c         c         g 2020-12-12       c
5      5         t 2021-03-22 2021-03-25 2021-03-29       <NA>         b         t           2021-03-29       t
6      6         b 2021-04-04 2021-04-07 2021-04-09       <NA>         b         t           2021-04-07       b

There are some techniques I followed from the last question like using c_across,across and rowwise but I can't seem to get R to bring back a single output. Is this possible? Do I need to structure the data long-wise again?


Solution

  • Sorry for delayed response, but you could do something like this

    • Person should have a result (perhaps a typo in your dput)
    • replaced empty strings '' with NA in the data
    example <- structure(list(person = 1:6, start_loc = c("a", "a", "g", "r", 
                                                          "t", "b"), start_date = structure(c(18668, 18657, 18600, 18605, 
                                                                                              18708, 18721), class = "Date"), end_date.1 = structure(c(18675, 
                                                                                                                                                       18534, NA, 18608, 18711, 18724), class = "Date"), end_date.2 = structure(c(NA, 
                                                                                                                                                                                                                                  18608, NA, 18610, 18715, 18726), class = "Date"), end_date.3 = structure(c(NA, 
                                                                                                                                                                                                                                                                                                             18662, NA, 18632, NA, NA), class = "Date"), end_loc.1 = c("g", 
                                                                                                                                                                                                                                                                                                                                                                       "a", NA, "c", "b", "b"), end_loc.2 = c(NA, "a", NA, "c", "t", 
                                                                                                                                                                                                                                                                                                                                                                                                              "t"), end_loc.3 = c(NA, "g", NA, "g", NA, NA)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -6L))
    
    library(tidyverse)
    example %>% left_join(example %>% pivot_longer(cols = !c(person, start_loc, start_date), names_sep = '\\.', 
                                                   names_to = c('.value', 'number'),
                                                   values_drop_na = T) %>%
                            group_by(person) %>%
                            mutate(diff = end_date - start_date,
                                   cond2 = diff <= 7 & diff >= 0,
                                   cond1 = start_loc == end_loc) %>%
                            filter(cond2) %>%
                            arrange(person, -cond1, diff) %>%
                            summarise(end_date = first(end_date),
                                      end_loc = first(end_loc)), by = 'person')
    #>   person start_loc start_date end_date.1 end_date.2 end_date.3 end_loc.1
    #> 1      1         a 2021-02-10 2021-02-17       <NA>       <NA>         g
    #> 2      2         a 2021-01-30 2020-09-29 2020-12-12 2021-02-04         a
    #> 3      3         g 2020-12-04       <NA>       <NA>       <NA>      <NA>
    #> 4      4         r 2020-12-09 2020-12-12 2020-12-14 2021-01-05         c
    #> 5      5         t 2021-03-22 2021-03-25 2021-03-29       <NA>         b
    #> 6      6         b 2021-04-04 2021-04-07 2021-04-09       <NA>         b
    #>   end_loc.2 end_loc.3   end_date end_loc
    #> 1      <NA>      <NA> 2021-02-17       g
    #> 2         a         g 2021-02-04       g
    #> 3      <NA>      <NA>       <NA>    <NA>
    #> 4         c         g 2020-12-12       c
    #> 5         t      <NA> 2021-03-29       t
    #> 6         t      <NA> 2021-04-07       b
    

    Actually, the syntax inside left_join does the job of summarising

    example %>% pivot_longer(cols = !c(person, start_loc, start_date), names_sep = '\\.', 
                                                   names_to = c('.value', 'number'),
                                                   values_drop_na = T) %>%
                            group_by(person) %>%
                            mutate(diff = end_date - start_date,
                                   cond2 = diff <= 7 & diff >= 0,
                                   cond1 = start_loc == end_loc) %>%
                            filter(cond2) %>%
                            arrange(person, -cond1, diff) %>%
                            summarise(end_date = first(end_date),
                                      end_loc = first(end_loc))
    
    # A tibble: 5 x 3
      person end_date   end_loc
       <int> <date>     <chr>  
    1      1 2021-02-17 g      
    2      2 2021-02-04 g      
    3      4 2020-12-12 c      
    4      5 2021-03-29 t      
    5      6 2021-04-07 b