Search code examples
rrun-length-encoding

Min and max of NA sequences


I have a data frame in which the column foo contains running sequences of NA values. For example:

> test
   id  foo                time
1   1 <NA> 2018-11-19 00:00:48
2   1 <NA> 2018-11-19 00:10:51
3   1 <NA> 2018-11-19 00:21:15
4   1 <NA> 2018-11-19 00:31:02
5   1    x 2018-11-19 00:40:59
6   1    x 2018-11-19 00:50:49
7   1    x 2018-11-19 01:01:15
8   1 <NA> 2018-11-19 01:11:07
9   1 <NA> 2018-11-19 01:20:49
10  2 <NA> 2018-11-19 01:30:50
11  2 <NA> 2018-11-19 01:40:43
12  2    x 2018-11-19 01:50:46
13  2    x 2018-11-19 02:01:02
14  2    x 2018-11-19 02:10:44
15  2 <NA> 2018-11-19 02:20:51
16  2 <NA> 2018-11-19 02:31:06
17  2 <NA> 2018-11-19 02:40:42
18  2 <NA> 2018-11-19 02:50:45
19  3 <NA> 2018-11-19 03:01:00
20  3 <NA> 2018-11-19 03:10:42
21  3 <NA> 2018-11-19 03:21:10
22  3 <NA> 2018-11-19 03:31:10
23  3    x 2018-11-19 03:40:44
24  3 <NA> 2018-11-19 03:50:46
25  3 <NA> 2018-11-19 04:00:46

My objective is to mark where each sequence begins by id and time for example - the above dataset would have an extra column called index which marks where the starts and ends of these NA values are. However, the last NA in the id series should be ignored, and a single NA value would be marked as "both". For example:

> test
   id  foo                time     index
1   1 <NA> 2018-11-19 00:00:48 na_starts
2   1 <NA> 2018-11-19 00:10:51          
3   1 <NA> 2018-11-19 00:21:15          
4   1 <NA> 2018-11-19 00:31:02   na_ends
5   1    x 2018-11-19 00:40:59          
6   1    x 2018-11-19 00:50:49          
7   1    x 2018-11-19 01:01:15          
8   1 <NA> 2018-11-19 01:11:07 na_starts
9   1 <NA> 2018-11-19 01:20:49          
10  2 <NA> 2018-11-19 01:30:50 na_starts
11  2 <NA> 2018-11-19 01:40:43   na_ends
12  2    x 2018-11-19 01:50:46          
13  2    x 2018-11-19 02:01:02          
14  2    x 2018-11-19 02:10:44          
15  2 <NA> 2018-11-19 02:20:51 na_starts
16  2 <NA> 2018-11-19 02:31:06          
17  2 <NA> 2018-11-19 02:40:42          
18  2 <NA> 2018-11-19 02:50:45          
19  3 <NA> 2018-11-19 03:01:00          
20  3 <NA> 2018-11-19 03:10:42 na_starts
21  3 <NA> 2018-11-19 03:21:10          
22  3 <NA> 2018-11-19 03:31:10   na_ends
23  3    x 2018-11-19 03:40:44          
24  3 <NA> 2018-11-19 03:50:46      both
25  3    x 2018-11-19 04:00:46   

How would one achieve this with rle or a similar function in R?

 dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA, 
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA, 
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648, 
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275, 
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862, 
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460, 
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-25L), class = "data.frame")

Solution

  • Maybe this will work? I'm not entirely sure what relationship time has to the problem other than I think you wanted it sorted by id and time.

    library("tidyverse")                                                                                                                                            -25L), class = "data.frame")
    test = test %>% 
      arrange(id, time) %>% 
      mutate(miss = is.na(foo))
    
    # This will make the index column for a single run
    mark_ends = function(n, miss){
      if(!miss){
        rep("", times = n)
      }
      else{
        if(n == 1){"both"}
        else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
    }
    
    # This will use mark_ends across a single ID
    mark_index = function(id){
       runs = test$miss[test$id == id] %>% 
         rle
      result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>% 
        reduce(.f = c)
      result[length(result)] = ""
      result
    }
    
    # use the function on each id, combine, and put it in test
    test$index = unique(test$id) %>% 
      map(mark_index) %>% 
      reduce(.f = c)