Search code examples
rdata-manipulation

What is the best way to check for consecutive missing values in a data column in R and exclude them based on a related column value?


I am trying to write code in R for a dataset to check if DAYS column have consecutive numbers and print out the missing DAYS number, in such a way that, if the count of missing consecutive numbers between two rows of the DAYS column equals to that count+1 in the corresponding last row of the PERIOD column, exclude it from the output. For example, consider the two rows in DAYS column 163 and 165, where the count of missing number is 1. But in this case, the last row (where DAYS is 165) has PERIOD value of 2, that is (count+1). So, exclude this missing value (164) from the output. However if you look at DAYS 170 and 172,y you can see 172 has PERIOD value of 1 (not 2 or count+1). So, show this output (171).

Here is the first 28 rows of the dataset.

DAYS PERIOD
146 1
147 1
148 1
149 1
150 1
151 1
152 1
153 1
154 1
155 1
156 1
157 1
158 1
159 1
160 1
161 1
162 1
163 1
165 2
166 1
167 1
168 1
169 1
170 1
172 1
173 1
174 1
175 1



I tried First, created a sequence of expected DAYS values expected_days <- seq(min(hs$DAYS), max(hs$DAYS))

Then, find the missing DAYS values missing_days <- setdiff(expected_days, hs$DAYS)

How to do the next bit?


Solution

  • I've managed to do this using tidyverse tools:

    Set up example data

    I've tweaked your data slightly to show that the solution can handle longer runs of missing days.

    library(vroom)
    library(dplyr)
    library(tidyr)
    
    test <-
      vroom(
        I(
    "days period
    161 1
    162 1
    163 1
    166 3
    167 1
    168 1
    169 1
    170 1
    172 1
    "),
    col_types = c("ii"))
    
    

    Add 'empty' days explicitly to data frame

    all_days <- min(test[["days"]]):max(test[["days"]])
    
    frame <- tibble(days = all_days)
    
    test <-
      right_join(test, frame, by = "days") |> 
      arrange(days)
    
    test
    #> # A tibble: 12 × 2
    #>     days period
    #>    <int>  <int>
    #>  1   161      1
    #>  2   162      1
    #>  3   163      1
    #>  4   164     NA
    #>  5   165     NA
    #>  6   166      3
    #>  7   167      1
    #>  8   168      1
    #>  9   169      1
    #> 10   170      1
    #> 11   171     NA
    #> 12   172      1
    

    Find the number of consecutive missing days

    test <- 
      mutate(test,
             no_na = xor(is.na(period), is.na(lag(period))),
              missingness_group = cumsum(no_na)) |> 
      select(-no_na)
    
    test <- 
      group_by(test, missingness_group) |> 
      mutate(missing_days = 
               case_when(
                 all(is.na(period)) ~ n(),
                 TRUE               ~ 0)) |> 
      ungroup() |> 
      select(-missingness_group)
    
    test
    #> # A tibble: 12 × 3
    #>     days period missing_days
    #>    <int>  <int>        <dbl>
    #>  1   161      1            0
    #>  2   162      1            0
    #>  3   163      1            0
    #>  4   164     NA            2
    #>  5   165     NA            2
    #>  6   166      3            0
    #>  7   167      1            0
    #>  8   168      1            0
    #>  9   169      1            0
    #> 10   170      1            0
    #> 11   171     NA            1
    #> 12   172      1            0
    

    Remove rows where days are all accounted for

    test <- mutate(test, extra_days = period - 1)
    
    test <- fill(test, extra_days, .direction = "up")
    
    test <-
      filter(test, !is.na(period) | missing_days > extra_days) |> 
      select(days, period)
    
    test
    #> # A tibble: 10 × 2
    #>     days period
    #>    <int>  <int>
    #>  1   161      1
    #>  2   162      1
    #>  3   163      1
    #>  4   166      3
    #>  5   167      1
    #>  6   168      1
    #>  7   169      1
    #>  8   170      1
    #>  9   171     NA
    #> 10   172      1
    

    Created on 2023-06-01 with reprex v2.0.2