Search code examples
rdatetidyversedata-cleaningpanel-data

Datacleaning: Keeping data that has at least n successive observation


Hey guys I need some help with cleaning my data, I am still quite new with R, so please go easy on me.

I have posted a sample of the data below.

  • It is a panel dataset, that looks at the a company's price over time (the value column)
  • The name of the company is the cik column

The problem is that I don't have complete data for every single company over the entire period, some companies don't have any data at all during the entire period and as a result their value will be NA for all dates, however some will have data over a certain period such as from 2004-2006 however won't have data for the entire sample period. I want to keep the data when there is consecutive data for 24 months straight or more.

For example if the company has data from 2005/01/01 to 2008/01/01 I want to keep this data. However if the company has data from 2005/01/01 to 2006/01/01 I want to exclude this data. In the data there will also be some companies which has break for example a company may have data from 2005/01/01 to 2006/01/01 and also 2008/01/01 to 2015/01/01, in which case I want to keep the data from 2008/01/01 to 2015/01/01 but exclude the data from 2005/01/01 to 2006/01/01.

cik     value   date
<dbl>   <dbl>   <date>
1001039 41.500  2004-10-31      
1001039 26.000  2004-11-30      
1001039 24.500  2004-12-31      
1001039 31.250  2005-01-31      
1001039 24.090  2005-02-28      
1001039 30.250  2005-03-31      
1001039 35.340  2005-04-30      
1001039 38.500  2005-05-31      
1001039 30.000  2005-06-30      
1001039 27.250  2005-07-31      
1001039 34.500  2005-08-31      
1001039 37.500  2005-09-30      
1001039 43.750  2005-10-31      
1001039 40.120  2005-11-30      
1001039 33.250  2005-12-31      
1001039 29.000  2006-01-31      
1001039 23.500  2006-02-28      
1001039 21.500  2006-03-31      
1001039 17.840  2006-04-30      
1001039 21.000  2006-05-31      
1001039 18.060  2006-06-30      
1001039 19.160  2006-07-31      
1001039 18.500  2006-08-31      
1001039 17.750  2006-09-30      
1001039 15.500  2006-10-31      
1001039 12.000  2006-11-30      
1001039 11.500  2006-12-31      
1001039 9.915   2007-01-31          
104169  5.700   2007-05-31      
104169  10.000  2007-06-30      
104169  21.000  2007-07-31      
104169  22.500  2007-08-31      
104169  18.500  2007-09-30      
104169  22.500  2007-10-31      
104169  28.000  2007-11-30      
104169  26.000  2007-12-31      
104169  35.500  2008-01-31      
104169  57.200  2008-02-29      
104169  42.300  2008-03-31      
104169  30.800  2008-04-30      
104169  35.500  2008-05-31  
1001039 34.980  2017-10-31      
1001039 33.360  2017-11-30      
1001039 32.970  2017-12-31      
1001039 28.685  2018-01-31      
1001039 28.200  2018-02-28      
1001039 35.855  2018-03-31      
1001039 29.755  2018-04-30      
1001039 34.145  2018-05-31      
1001039 48.275  2018-06-30      
1001039 37.310  2018-07-31      
1001039 34.600  2018-08-31      
1001039 36.980  2018-09-30      
1001039 34.615  2018-10-31      
1001039 35.195  2018-11-30      
1001039 40.920  2018-12-31      
1001039 29.050  2019-01-31      
1001039 21.005  2019-02-28      
1001039 18.420  2019-03-31      
1001039 16.980  2019-04-30      
1001039 21.280  2019-05-31      
1001039 17.235  2019-06-30      
1001039 15.700  2019-07-31      
1001039 15.990  2019-08-31      
1001039 19.525  2019-09-30      
1001039 19.500  2019-10-31      
1001039 16.885  2019-11-30      
1001039 15.940  2019-12-31      
1001039 18.770  2020-01-31      
1001039 28.650  2020-02-29      
1001039 59.405  2020-03-31      
1001039 36.265  2020-04-30      
1001039 26.345  2020-05-31      
1001039 23.960  2020-06-30      
1001039 22.195  2020-07-31      
100885  40.010  2004-10-31      
100885  26.500  2004-11-30      
100885  30.000  2004-12-31
100885  70.200  2008-03-31      
100885  40.300  2008-04-30      
100885  51.500  2008-05-31      
100885  72.500  2008-06-30      
100885  55.600  2008-07-31      
100885  66.000  2008-08-31      
100885  54.200  2008-09-30      
100885  85.500  2008-10-31      
100885  140.000 2008-11-30      
100885  100.000 2008-12-31      
100885  104.100 2009-01-31      
100885  101.090 2009-02-28      
100885  86.960  2009-03-31      
100885  63.850  2009-04-30      
100885  50.500  2009-05-31      
100885  57.160  2009-06-30      
100885  38.000  2009-07-31      
100885  48.500  2009-08-31      
100885  44.250  2009-09-30      
100885  59.790  2009-10-31      
100885  70.130  2009-11-30      
100885  45.420  2009-12-31      
100885  59.290  2010-01-31      
100885  58.980  2010-02-28      
100885  58.500  2010-03-31      
100885  52.250  2010-04-30      
100885  54.360  2010-05-31      
100885  50.500  2010-06-30      
100885  41.870  2010-07-31      

Thank so much everyone!


Solution

  • You can create groups where the monthly sequence breaks (difference between consecutive date is greater than 31 days) and select only those groups which has atleast 24 enteries (2 years).

    library(dplyr)
    df %>%
      arrange(cik, date) %>%
      group_by(cik, grp = cumsum(c(TRUE, diff(date)) > 31)) %>%
      filter(n() >= 24)