Search code examples
rdataframedplyrtidyverse

Filtering one dataset using the Observations in another dataset


I have two datasets. The first dataset looks like -

data.frame (year = rep(2005:2022,2), CompanyID = rep(6:7, each = 18))
year        CompanyID
1  2005         6
2  2006         6
3  2007         6
4  2008         6
5  2009         6
6  2010         6
7  2011         6
8  2012         6
9  2013         6
10 2014         6
11 2015         6
12 2016         6
13 2017         6
14 2018         6
15 2019         6
16 2020         6
17 2021         6
18 2022         6
19 2005         7
20 2006         7
21 2007         7
22 2008         7
23 2009         7
24 2010         7
25 2011         7
26 2012         7
27 2013         7
28 2014         7
29 2015         7
30 2016         7
31 2017         7
32 2018         7
33 2019         7
34 2020         7
35 2021         7
36 2022         7

The second data set looks like this -

data.frame (year = c(2012, 2007, 2015, 2009), CompanyID = c(6,6,7,7))
   year     CompanyID
1 2012         6
2 2007         6
3 2015         7
4 2009         7

Now I want the following -

In the first dataset, I want to keep the observations one year before (t-1) and one year after (t+1) of the year variable from second dataset as the CompanyID in first dataset and second dataset are the same. So my expected output should look like this -

   year CompanyID POST
1 2006         6    0
2 2008         6    1
3 2011         6    0
4 2013         6    1
5 2008         7    0
6 2010         7    1
7 2014         7    0
8 2016         7    1

The POST variable is also needed to identify the observations before year (t-1) or after year (t+1)

I have many observations for many CompanyID, but here I show data only for two CompanyID, which are 6 and 7.


Solution

  • First modify your second dataset (df_ref in my code) to include one year before and after the target, also a POST column specifying if it's before or after. Then join the two together.

    library(dplyr)
    
    df_data <- data.frame (year = rep(2005:2022,2), CompanyID = rep(6:7, each = 18))
    df_ref <- data.frame (year = c(2012, 2007, 2015, 2009), CompanyID = c(6,6,7,7))
    
    df_ref |> 
      reframe(year = c(year - 1, year + 1), POST = c(0, 1), .by = c(year, CompanyID)) |> 
      left_join(df_data, by = c("year", "CompanyID"))
    
      year CompanyID POST
    1 2011         6    0
    2 2013         6    1
    3 2006         6    0
    4 2008         6    1
    5 2014         7    0
    6 2016         7    1
    7 2008         7    0
    8 2010         7    1