Search code examples
rdplyrtime-seriesdummy-variable

dplyr: Filter within years prior to marked time period


I want have a list of year-country specific dummies and I want to also mark years two years prior to those years that are marked.

The data looks like this

library(tidyverse)

df <- tribble(
  ~year, ~country, ~occurrence, 
  #--|--|----
  2003, "USA", 1,
  2004, "USA", 0,
  2005, "USA", 0,
  2006, "USA", 0,
  2007, "USA", 0,
  2008, "USA", 0,
  2009, "USA", 0,
  2010, "USA", 0,
  2011, "USA", 1,
  2012, "USA", 0,
  2013, "USA", 0,
  2005, "FRA", 0,
  2006, "FRA", 0,
  2007, "FRA", 1,
  2008, "FRA", 1,
  2009, "FRA", 0,
  2010, "FRA", 0,
  2011, "FRA", 0,
  2012, "FRA", 0,
  2013, "FRA", 0,
  2014, "FRA", 0,
  2015, "FRA", 1
)

So for "USA" I also want to put a 1 into column occurence for the years 2009 and 2010 and for FRA the years 2005, 2006, 2013 and 2014.

I thought about doing something like this:

df %>%
  group_by(country) %>%
  mutate(occurence = ifelse("not sure what to put here"),
                            1,
                            0))

But I'm not sure how to tell R only to filter for the years I want.


Solution

  • Here is another dplyr solution:

    df %>% 
      group_by(country) %>% 
      mutate(
            occurrence=ifelse( lead(occurrence, 1) %in% 1 | 
                               lead(occurrence, 2) %in% 1, 
                               1, occurrence)
             )
    
    # A tibble: 22 x 3
    # Groups:   country [2]
        year country occurrence
       <dbl>   <chr>      <dbl>
     1  2003     USA          1
     2  2004     USA          0
     3  2005     USA          0
     4  2006     USA          0
     5  2007     USA          0
     6  2008     USA          0
     7  2009     USA          1
     8  2010     USA          1
     9  2011     USA          1
    10  2012     USA          0
    11  2013     USA          0
    12  2005     FRA          1
    13  2006     FRA          1
    14  2007     FRA          1
    15  2008     FRA          1
    16  2009     FRA          0
    17  2010     FRA          0
    18  2011     FRA          0
    19  2012     FRA          0
    20  2013     FRA          1
    21  2014     FRA          1
    22  2015     FRA          1
    

    lead(occurrence, 1) %in% 1 is used instead of lead(occurrence, 1) == 1 because the latter cannot handle NA.