Hey guys I need some help with cleaning my data, I am still quite new with R, so please go easy on me.
I have posted a sample of the data below.
The problem is that I don't have complete data for every single company over the entire period, some companies don't have any data at all during the entire period and as a result their value will be NA for all dates, however some will have data over a certain period such as from 2004-2006 however won't have data for the entire sample period. I want to keep the data when there is consecutive data for 24 months straight or more.
For example if the company has data from 2005/01/01 to 2008/01/01 I want to keep this data. However if the company has data from 2005/01/01 to 2006/01/01 I want to exclude this data. In the data there will also be some companies which has break for example a company may have data from 2005/01/01 to 2006/01/01 and also 2008/01/01 to 2015/01/01, in which case I want to keep the data from 2008/01/01 to 2015/01/01 but exclude the data from 2005/01/01 to 2006/01/01.
cik value date
<dbl> <dbl> <date>
1001039 41.500 2004-10-31
1001039 26.000 2004-11-30
1001039 24.500 2004-12-31
1001039 31.250 2005-01-31
1001039 24.090 2005-02-28
1001039 30.250 2005-03-31
1001039 35.340 2005-04-30
1001039 38.500 2005-05-31
1001039 30.000 2005-06-30
1001039 27.250 2005-07-31
1001039 34.500 2005-08-31
1001039 37.500 2005-09-30
1001039 43.750 2005-10-31
1001039 40.120 2005-11-30
1001039 33.250 2005-12-31
1001039 29.000 2006-01-31
1001039 23.500 2006-02-28
1001039 21.500 2006-03-31
1001039 17.840 2006-04-30
1001039 21.000 2006-05-31
1001039 18.060 2006-06-30
1001039 19.160 2006-07-31
1001039 18.500 2006-08-31
1001039 17.750 2006-09-30
1001039 15.500 2006-10-31
1001039 12.000 2006-11-30
1001039 11.500 2006-12-31
1001039 9.915 2007-01-31
104169 5.700 2007-05-31
104169 10.000 2007-06-30
104169 21.000 2007-07-31
104169 22.500 2007-08-31
104169 18.500 2007-09-30
104169 22.500 2007-10-31
104169 28.000 2007-11-30
104169 26.000 2007-12-31
104169 35.500 2008-01-31
104169 57.200 2008-02-29
104169 42.300 2008-03-31
104169 30.800 2008-04-30
104169 35.500 2008-05-31
1001039 34.980 2017-10-31
1001039 33.360 2017-11-30
1001039 32.970 2017-12-31
1001039 28.685 2018-01-31
1001039 28.200 2018-02-28
1001039 35.855 2018-03-31
1001039 29.755 2018-04-30
1001039 34.145 2018-05-31
1001039 48.275 2018-06-30
1001039 37.310 2018-07-31
1001039 34.600 2018-08-31
1001039 36.980 2018-09-30
1001039 34.615 2018-10-31
1001039 35.195 2018-11-30
1001039 40.920 2018-12-31
1001039 29.050 2019-01-31
1001039 21.005 2019-02-28
1001039 18.420 2019-03-31
1001039 16.980 2019-04-30
1001039 21.280 2019-05-31
1001039 17.235 2019-06-30
1001039 15.700 2019-07-31
1001039 15.990 2019-08-31
1001039 19.525 2019-09-30
1001039 19.500 2019-10-31
1001039 16.885 2019-11-30
1001039 15.940 2019-12-31
1001039 18.770 2020-01-31
1001039 28.650 2020-02-29
1001039 59.405 2020-03-31
1001039 36.265 2020-04-30
1001039 26.345 2020-05-31
1001039 23.960 2020-06-30
1001039 22.195 2020-07-31
100885 40.010 2004-10-31
100885 26.500 2004-11-30
100885 30.000 2004-12-31
100885 70.200 2008-03-31
100885 40.300 2008-04-30
100885 51.500 2008-05-31
100885 72.500 2008-06-30
100885 55.600 2008-07-31
100885 66.000 2008-08-31
100885 54.200 2008-09-30
100885 85.500 2008-10-31
100885 140.000 2008-11-30
100885 100.000 2008-12-31
100885 104.100 2009-01-31
100885 101.090 2009-02-28
100885 86.960 2009-03-31
100885 63.850 2009-04-30
100885 50.500 2009-05-31
100885 57.160 2009-06-30
100885 38.000 2009-07-31
100885 48.500 2009-08-31
100885 44.250 2009-09-30
100885 59.790 2009-10-31
100885 70.130 2009-11-30
100885 45.420 2009-12-31
100885 59.290 2010-01-31
100885 58.980 2010-02-28
100885 58.500 2010-03-31
100885 52.250 2010-04-30
100885 54.360 2010-05-31
100885 50.500 2010-06-30
100885 41.870 2010-07-31
Thank so much everyone!
You can create groups where the monthly sequence breaks (difference between consecutive date
is greater than 31 days) and select only those groups which has atleast 24 enteries (2 years).
library(dplyr)
df %>%
arrange(cik, date) %>%
group_by(cik, grp = cumsum(c(TRUE, diff(date)) > 31)) %>%
filter(n() >= 24)