Search code examples
rdata-cleaning

Remove days based on number of hours missing


I have some air pollution data measured by hours.

Datetime PM2.5 Station.id
2020-01-01 00:00:00 10 1
2020-01-01 01:00:00 NA 1
2020-01-01 02:00:00 15 1
2020-01-01 03:00:00 NA 1
2020-01-01 04:00:00 7 1
2020-01-01 05:00:00 20 1
2020-01-01 06:00:00 30 1
2020-01-01 00:00:00 NA 2
2020-01-01 01:00:00 17 2
2020-01-01 02:00:00 21 2
2020-01-01 03:00:00 55 2

I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.

PS. The original data can be either NAs have already been removed OR NAs are inserted.


Solution

  • The "most efficient" way will almost certainly use data.table. Something like this:

    library(data.table)
    setDT(your_data)
    your_data[, date := as.IDate(Datetime)][,
      if(
        !(sum(is.na(PM2.5)) >= 18 & 
        with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
      )) .SD,
      by = .(date, station.id)
    ]
    #          date            Datetime PM2.5
    # 1: 2020-01-01 2020-01-01 00:00:00    10
    # 2: 2020-01-01 2020-01-01 01:00:00    NA
    # 3: 2020-01-01 2020-01-01 02:00:00    15
    # 4: 2020-01-01 2020-01-01 03:00:00    NA
    # 5: 2020-01-01 2020-01-01 04:00:00     7
    # 6: 2020-01-01 2020-01-01 05:00:00    20
    # 7: 2020-01-01 2020-01-01 06:00:00    30
    

    Using this sample data:

    your_data = fread(text = 'Datetime  PM2.5
    2020-01-01 00:00:00 10
    2020-01-01 01:00:00 NA
    2020-01-01 02:00:00 15
    2020-01-01 03:00:00 NA
    2020-01-01 04:00:00 7
    2020-01-01 05:00:00 20
    2020-01-01 06:00:00 30')