Search code examples
rdplyrtidyrlubridate

Create columns based on date filter within group


I have the following data:

> dput(my_data)
structure(list(id = c(1, 1, 1, 2, 2, 3, 3), begin = c("2017-01-01", 
"2017-08-01", "2022-05-01", "2017-01-01", "2017-09-01", "2017-01-01", 
"2017-09-01"), end = c("2017-07-01", "2022-04-01", "2023-06-01", 
"2017-08-01", "2023-06-01", "2017-08-01", "2023-06-01"), position = c("position_1", 
"position_2", "position_3", "position_1", "position_2", "position_1", 
"position_1")), row.names = c(NA, -7L), class = "data.frame")


 id      begin        end   position
1  1 2017-01-01 2017-07-01 position_1
2  1 2017-08-01 2022-04-01 position_2
3  1 2022-05-01 2023-06-01 position_3
4  2 2017-01-01 2017-08-01 position_1
5  2 2017-09-01 2023-06-01 position_2
6  3 2017-01-01 2017-08-01 position_1
7  3 2017-09-01 2023-06-01 position_1

What I want: I am trying to create a new column (e.g. first_position_in_2023) that contains the first position for each id that was present in 2023-01-01. (The first position with a timestamp begin-end that includes 2023-01-01.

So for my Data:

  • id == 1: position_3
  • id == 2: position_2
  • id == 2: position_1

Ideally I would want to be able to create multiple columns like:

  • first_position_in_2017
  • first_position_in_2018
  • first_position_in_2019 ...

What I tried:


my_data %>%
  group_by(id) %>%
  summarise(first_position_in_2023 = position[begin <= (as.Date("2023-01-01"))])


# A tibble: 7 x 2
# Groups:   id [3]
     id first_position_in_2023
  <dbl> <chr>                 
1     1 position_1            
2     1 position_2            
3     1 position_3            
4     2 position_1            
5     2 position_2            
6     3 position_1            
7     3 position_1  

This seems to give me just the result per Row and not the first result for each group.


Solution

  • Since you include all the tags, you could try this dplyr/tidyr/lubridate approach for the broader goal of doing it for each year:

    library(dplyr)
    library(tidyr)
    
    df %>%
      mutate(year = lubridate::year(end)) %>% select(-c(begin, end)) %>%
      pivot_wider(names_from = year, values_from = position, 
                  names_glue = "first_position_in_{year}")
    

    Output:

        id first_position_in_2017 first_position_in_2022 first_position_in_2023
      <dbl> <chr>                  <chr>                  <chr>                 
    1     1 position_1             position_2             position_3            
    2     2 position_1             NA                     position_2            
    3     3 position_1             NA                     position_1  
    

    Per your comment, if you wanted to create columns for years that are not in the data (ie, 2019, 2020, etc), you could use complete just before pivot_wider:

    df %>%
      mutate(year = lubridate::year(end)) %>% select(-c(begin, end)) %>%
      complete(year = min(year):max(year)) %>%
      pivot_wider(names_from = year, values_from = position, 
                  names_glue = "first_position_in_{year}") %>% filter(!is.na(id))
    

    Output

         id first_position_in_2017 first_position_in_2018 first_position_in_2019 first_position_in_2020 first_position_in_2021 first_position_in_2022 first_position_in_2023
      <dbl> <chr>                  <chr>                  <chr>                  <chr>                  <chr>                  <chr>                  <chr>                 
    1     1 position_1             NA                     NA                     NA                     NA                     position_2             position_3            
    2     2 position_1             NA                     NA                     NA                     NA                     NA                     position_2            
    3     3 position_1             NA                     NA                     NA                     NA                     NA                     position_1