Remove days based on number of hours missing

I have some air pollution data measured by hours.

Datetime	PM2.5	Station.id
2020-01-01 00:00:00	10	1
2020-01-01 01:00:00	NA	1
2020-01-01 02:00:00	15	1
2020-01-01 03:00:00	NA	1
2020-01-01 04:00:00	7	1
2020-01-01 05:00:00	20	1
2020-01-01 06:00:00	30	1
2020-01-01 00:00:00	NA	2
2020-01-01 01:00:00	17	2
2020-01-01 02:00:00	21	2
2020-01-01 03:00:00	55	2

I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.

PS. The original data can be either NAs have already been removed OR NAs are inserted.

Solution

The "most efficient" way will almost certainly use data.table. Something like this:

library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
  if(
    !(sum(is.na(PM2.5)) >= 18 & 
    with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
  )) .SD,
  by = .(date, station.id)
]
#          date            Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00    10
# 2: 2020-01-01 2020-01-01 01:00:00    NA
# 3: 2020-01-01 2020-01-01 02:00:00    15
# 4: 2020-01-01 2020-01-01 03:00:00    NA
# 5: 2020-01-01 2020-01-01 04:00:00     7
# 6: 2020-01-01 2020-01-01 05:00:00    20
# 7: 2020-01-01 2020-01-01 06:00:00    30

Using this sample data:

your_data = fread(text = 'Datetime  PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')