Search code examples
rdata-manipulation

R: Find Out First Non-Consecutive Year


I have this dataset in R:

name = c("john", "john", "john", "john", "john", "alex", "alex", "alex", "alex", "alex" )
year = c(2010, 2011, 2012, 2015, 2017, 2014, 2015, 2016, 2017, 2018)
my_data = data.frame(name, year)

> my_data
   name year
1  john 2010
2  john 2011
3  john 2012
4  john 2015
5  john 2017
6  alex 2014
7  alex 2015
8  alex 2016
9  alex 2017
10 alex 2018

For each person in this dataset - I want to select all rows until the first non-consecutive year appears.

As an example, the final output would look something like this:

# DESIRED OUTPUT
> my_data
   name year
1  john 2010
2  john 2011
3  john 2012
6  alex 2014
7  alex 2015
8  alex 2016
9  alex 2017
10 alex 2018

To do this, I thought of the following approach:

> agg <- aggregate(year ~ name, my_data, c)
> agg
  name year.1 year.2 year.3 year.4 year.5
1 alex   2014   2015   2016   2017   2018
2 john   2010   2011   2012   2015   2017

library(stringr)
agg = data.frame(as.matrix(agg))
agg$all_years = paste(agg$year.1 ,agg$year.2,  agg$year.3, agg$year.4, agg$year.5)

agg$y_2010 = str_detect(agg$all_years, "2010")
agg$y_2011 = str_detect(agg$all_years, "2011")
agg$y_2012 = str_detect(agg$all_years, "2012")
agg$y_2013 = str_detect(agg$all_years, "2013")
agg$y_2014 = str_detect(agg$all_years, "2014")
agg$y_2015 = str_detect(agg$all_years, "2015")
agg$y_2016 = str_detect(agg$all_years, "2016")
agg$y_2017 = str_detect(agg$all_years, "2017")
agg$y_2018 = str_detect(agg$all_years, "2018")
agg$y_2019 = str_detect(agg$all_years, "2019")

  name year.1 year.2 year.3 year.4 year.5                all_years y_2010 y_2011 y_2012 y_2013 y_2014 y_2015 y_2016 y_2017 y_2018 y_2019
1 alex   2014   2015   2016   2017   2018 2014 2015 2016 2017 2018  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE   TRUE  FALSE
2 john   2010   2011   2012   2015   2017 2010 2011 2012 2015 2017   TRUE   TRUE   TRUE  FALSE  FALSE   TRUE  FALSE   TRUE  FALSE  FALSE

Now, the idea would be for each row - find out the first time when a "TRUE" is followed by a "FALSE" - and then I would try to find some way to accomplish my task.

But I am not sure how to proceed from here.

Can someone please show me how to do this?

Thanks!


Solution

  • library(dplyr)
    
    my_data %>% 
      group_by(name) %>% 
      filter(c(1,diff(year)) == 1)
    
    # A tibble: 8 x 2
    # Groups:   name [2]
      name   year
      <chr> <dbl>
    1 john   2010
    2 john   2011
    3 john   2012
    4 alex   2014
    5 alex   2015
    6 alex   2016
    7 alex   2017
    8 alex   2018