Search code examples
rdplyrtidyverse

Filter for specific combination of two IDs


I have a dataframe my_df which already has some values for an ID/Date combination:

set.seed(42)
my_df <- data.frame(ID = c('A', 'B', 'C', 'A', 'B'),
                    Date = seq(lubridate::date('2022-01-01'), lubridate::date('2022-01-05'), by = 1),
                    Value = rnorm(5))

> my_df
  ID       Date      Value
1  A 2022-01-01  1.3709584
2  B 2022-01-02 -0.5646982
3  C 2022-01-03  0.3631284
4  A 2022-01-04  0.6328626
5  B 2022-01-05  0.4042683

Now I have a second data frame new_df with partly the same ID/Date combinations, partly new ones:

new_df <- data.frame(ID = c('A', 'B', 'C', 'A', 'B'),
                     Date = seq(lubridate::date('2022-01-01'), lubridate::date('2022-01-05'), by = 1)) |>
    dplyr::bind_rows(data.frame(ID = c('A', 'B', 'D', 'D'),
                                Date = c(lubridate::date('2022-01-02'),
                                         lubridate::date('2022-01-01'),
                                         lubridate::date('2022-01-01'),
                                         lubridate::date('2022-01-07'))))

> new_df
  ID       Date
1  A 2022-01-01
2  B 2022-01-02
3  C 2022-01-03
4  A 2022-01-04
5  B 2022-01-05
6  A 2022-01-02
7  B 2022-01-01
8  D 2022-01-01
9  D 2022-01-07

I would like to filter new_df only for the four additional cases, i.e. combination of ID and Date. One way to do this is to create a dummy id simple concatenation, like so:

> new_df |>
+   dplyr::mutate(Dummy_ID = paste0(ID, Date)) |>
+   dplyr::filter(!(Dummy_ID %in% (my_df |> dplyr::mutate(Dummy_ID = paste0(ID, Date)) |> dplyr::pull(Dummy_ID))))
  ID       Date    Dummy_ID
1  A 2022-01-02 A2022-01-02
2  B 2022-01-01 B2022-01-01
3  D 2022-01-01 D2022-01-01
4  D 2022-01-07 D2022-01-07

Is it possible to achieve this result more elegantly without a dummy ID by only working with ID and Date?


Solution

  • anti_join is perfect for this situation, since it will look for combinations of entries in one dataframe but not the other:

    > new_df2 <- anti_join(new_df, my_df, by = c('ID','Date'))
    > new_df2
      ID       Date
    1  A 2022-01-02
    2  B 2022-01-01
    3  D 2022-01-01
    4  D 2022-01-07