Search code examples
rdplyrfiltertime-seriesidentify

Keep only the hours for which the value have not changed within the hour R


I have a timeseries dataset with 'n' number of columns. In the dataset, I would like to filter and remove the hours for which the value in a column changed within the hour. In other words, I want to keep the hours that has unchanged value.

Some info about the data:

  • 1-minute data with date in POSCIXT format (%Y-%m-%d %H:%M:%S)

  • The data is recorded for every 5 minutes, so for the time, for which the data is not recorded is filled with NA

    data1<- structure(list(PDT = structure(c(1668927600, 1668927660, 
       1668927720, 
      1668927780, 1668927840, 1668927900, 1668927960, 1668928020, 1668928080, 
      1668928140, 1668928200, 1668928260, 1668928320, 1668928380, 1668928440, 
      1668928500, 1668928560, 1668928620, 1668928680, 1668928740, 1668928800, 
      1668928860, 1668928920, 1668928980, 1668929040, 1668929100, 1668929160, 
      1668929220, 1668929280, 1668929340, 1668929400, 1668929460, 1668929520, 
      1668929580, 1668929640, 1668929700, 1668929760, 1668929820, 1668929880, 
      1668929940, 1668930000, 1668930060, 1668930120, 1668930180, 1668930240, 
      1668930300, 1668932640, 1668932700, 1668932760, 1668932820, 1668932880, 
      1668932940, 1668933000, 1668933060, 1668933120, 1668933180, 1668933240, 
      1668933300, 1668933360, 1668933720, 1668933780, 1668933840, 1668933900, 
      1668933960, 1668934020, 1668934080, 1668934140, 1668934200, 1668934260, 
      1668934320, 1668934380, 1668934440, 1668934500, 1668934560, 1668934620, 
      1668934680, 1668934740, 1668934800), class = c("POSIXct", "POSIXt"
      ), tzone = "UTC"), Date = structure(c(1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400), class = c("POSIXct", 
      "POSIXt"), tzone = "UTC"), DayOfWeek = c("Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday"), 
          Month = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11), Day = c(20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Hour_hr = c(7, 
          7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
          7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
          7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
          8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
          9), Minute = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
          13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
          28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
          43, 44, 45, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
          36, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
          56, 57, 58, 59, 0), ColA = c(NA, NA, NA, 65.5, NA, NA, NA, 
          NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, 
          NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, 
          NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 
          NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, 
          NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 
          65.7, NA, NA)), row.names = 3041:3118, class = "data.frame")
    

Expected output:

In the above example, I want to exclude hour 8 from my dataset, as the value in ColA is not constant.

I have a feeling that group_by() and filter() from dplyr might do the job, but I am not sure about the function to find the unchanged values within an hour.

Any help regarding this is much appreciated. Thanks.


Solution

  • This does it:

    data1 %>% group_by(Hour_hr)  %>% filter(n_distinct(ColA) < 3)
    

    Checking results:

    count(data1, Hour_hr)
    
      Hour_hr     n
        <dbl> <int>
    1       7    46
    2       9     1
    

    This will keep colA if there's only one numerical value or no numerical values (NA), keeping hour 7 and 9.

    Equivalently you could do:

    data1 %>% group_by(Hour_hr)  %>% filter(n_distinct(ColA, na.rm = T) < 2)