I have some air pollution data measured by hours.
Datetime | PM2.5 | Station.id |
---|---|---|
2020-01-01 00:00:00 | 10 | 1 |
2020-01-01 01:00:00 | NA | 1 |
2020-01-01 02:00:00 | 15 | 1 |
2020-01-01 03:00:00 | NA | 1 |
2020-01-01 04:00:00 | 7 | 1 |
2020-01-01 05:00:00 | 20 | 1 |
2020-01-01 06:00:00 | 30 | 1 |
2020-01-01 00:00:00 | NA | 2 |
2020-01-01 01:00:00 | 17 | 2 |
2020-01-01 02:00:00 | 21 | 2 |
2020-01-01 03:00:00 | 55 | 2 |
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.
The "most efficient" way will almost certainly use data.table
. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')