Search code examples
rdplyrcountregressionsequential

Cumulatively Count Gaps in Sequential Numbers Results in Different Answers When New Data Added


I asked a question a few days ago which you guys helped me solve and I am forever grateful! However, a new issue has presented itself and I am in need of your help once again!

Here's a link to the original problem: (R) Cumulatively Count Gaps in Sequential Numbers

I was trying to cumulatively counts gaps in sequential numbers for each selection of UniqueIDs. This was my dataset:

UniqueID  Month  
ABC123    1       
ABC123    2      
ABC123    3      
ABC123    4      
ABC123    6      
ABC123    7      
DEF456    3      
DEF456    4      
DEF456    10     
DEF456    11     
DEF456    12     
DEF456    14     
GHI789    2      
GHI789    3  
JKL012    12     
JKL012    13     
JKL012    14    

Using your help, I tweaked the code provided from the link above as follows:

data2=data %>%
       group_by(UniqueID) %>%
       mutate(Skip = if_else(Month - lag(Month, default = first(Month) - 1) - 1 > 0, 1, 0),
       CountSkip = cumsum(Skip))

data2 = data2%>% 
       group_by(UniqueID) %>%
       mutate(LastValue = if_else(Month == last(Month), 1, 0))

data2=as.data.frame(data2)
data2$FinalTally=ifelse(data2$LastValue==1 & data2$Month!=14,1,0)
data2$SeqCount=data2$FinalTally+data2$CountSkip

This was the resulting dataset:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ABC123    1      0    0         0          0            0
ABC123    2      0    0         0          0            0
ABC123    3      0    0         0          0            0 
ABC123    4      0    0         0          0            0
ABC123    6      1    1         0          0            1
ABC123    7      1    2         1          1            2
DEF456    3      0    0         0          0            0
DEF456    4      0    0         0          0            0
DEF456    10     1    1         0          0            1
DEF456    11     1    1         0          0            1
DEF456    12     1    1         0          0            1  
DEF456    14     2    2         1          0            2
GHI789    2      0    0         0          0            0
GHI789    3      0    1         1          1            1
JKL012    12     0    0         0          0            0
JKL012    13     0    0         0          0            0 
JKL012    14     0    0         1          0            0

This is what I wanted...or so I thought.

When adding in new data for the next month (15), I edited the second to last line of my code to account for 15 being the new final month. However, I noticed the sum of SeqCount by Month differed from the sum of that same month before the new data was added. I filtered down to one month and found an example of one UniqueID where the SeqCount sum had differed.

Here is an example before the new data was included:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ZZZ999    2      0    0         0          0            0
ZZZ999    3      0    0         0          0            0
ZZZ999    4      0    0         0          0            0 
ZZZ999    5      0    0         0          0            0
ZZZ999    6      0    0         1          1            1

Here is the example when the new data was included:

UniqueID  Month  Skip CountSkip LastValue  FinalTally   SeqCount
ZZZ999    2      0    0         0          0            0
ZZZ999    3      0    0         0          0            0
ZZZ999    4      0    0         0          0            0 
ZZZ999    5      0    0         0          0            0
ZZZ999    6      0    0         0          0            0
ZZZ999    15     1    1         1          0            1

This is the problem: Month 6 loses a value of SeqCount when new data is added in.

My ultimate goal is to run a regression model for each month with SeqCount as the response with some other columns as predictors (I didn't include them for ease of reading). Whenever I add new data in, the response will change and my estimates will not be consistent.

Is there a way I can structure my code differently so when I add new data, the logic of the code does not change the information from previous values of SeqCount?

Any help would be appreciated!

Thank you!


Solution

  • The following seems to reproduce what you want without hardcoding any values into the logic.

    Note - as others have commented, there appears to be a discrepancy between the results in the question and the code provided. For example, in the question post, UniqueID GHI789 for Month 3 has a CountSkip value of 1 though the code returns 0. The code in this answer returns 0.

    data <- cbind.data.frame(UniqueID = c('ABC123','ABC123','ABC123','ABC123','ABC123','ABC123','DEF456','DEF456','DEF456','DEF456','DEF456','DEF456','GHI789','GHI789','JKL012','JKL012','JKL012'),
                             Month = c(1,2,3,4,6,7,3,4,10,11,12,14,2,3,12,13,14))
    
    cartesian <- expand.grid(UniqueID = unique(as.character(data$UniqueID)),
                      Month = seq(from=min(data$Month), to=max(data$Month), by=1))
    
    BA <- cartesian %>% 
      left_join(data %>% mutate(Month_orig=Month), by=c("UniqueID","Month")) %>% 
      arrange(UniqueID, Month) %>% 
      group_by(UniqueID) %>% 
      mutate(Skip = ifelse(Month==Month_orig & 
                             is.na(lag(Month_orig,1)) & 
                             Month!=min(Month[!is.na(Month_orig)]), 
                           1, 0)) %>%
      mutate(Skip = ifelse(Skip==1 & is.na(lag(Month_orig,2)), 0, Skip)) %>% # contstrain to only one skipped period?
      filter(!is.na(Month_orig)) %>% 
      mutate(CountSkip = cumsum(Skip)) %>% 
      mutate(LastValue = ifelse(Month==max(Month), 1, 0)) %>%
      mutate(FinalTally = ifelse(LastValue==1 & Month != max(Month),1,0)) %>% 
      mutate(SeqCount = FinalTally + CountSkip) %>% 
      select(-Month_orig)
    
    BA
    # A tibble: 17 x 7
    # Groups:   UniqueID [4]
       UniqueID Month  Skip CountSkip LastValue FinalTally SeqCount
       <fct>    <dbl> <dbl>     <dbl>     <dbl>      <dbl>    <dbl>
     1 ABC123       1     0         0         0          0        0
     2 ABC123       2     0         0         0          0        0
     3 ABC123       3     0         0         0          0        0
     4 ABC123       4     0         0         0          0        0
     5 ABC123       6     1         1         0          0        1
     6 ABC123       7     0         1         1          0        1
     7 DEF456       3     0         0         0          0        0
     8 DEF456       4     0         0         0          0        0
     9 DEF456      10     0         0         0          0        0
    10 DEF456      11     0         0         0          0        0
    11 DEF456      12     0         0         0          0        0
    12 DEF456      14     1         1         1          0        1
    13 GHI789       2     0         0         0          0        0
    14 GHI789       3     0         0         1          0        0
    15 JKL012      12     0         0         0          0        0
    16 JKL012      13     0         0         0          0        0
    17 JKL012      14     0         0         1          0        0
    

    In reading the question and comments more closely - I agree that a reference to the previous version of the table would be required to implement a solution. You might try rbind(old_data,new_rows) where new_rows is the result of processing all data and keeping only the unique(UniqueID, Month) not contained in the old_data. I don't fully follow the logic or intent here so perhaps that is not a good idea.