I asked a question a few days ago which you guys helped me solve and I am forever grateful! However, a new issue has presented itself and I am in need of your help once again!
Here's a link to the original problem: (R) Cumulatively Count Gaps in Sequential Numbers
I was trying to cumulatively counts gaps in sequential numbers for each selection of UniqueIDs. This was my dataset:
UniqueID Month
ABC123 1
ABC123 2
ABC123 3
ABC123 4
ABC123 6
ABC123 7
DEF456 3
DEF456 4
DEF456 10
DEF456 11
DEF456 12
DEF456 14
GHI789 2
GHI789 3
JKL012 12
JKL012 13
JKL012 14
Using your help, I tweaked the code provided from the link above as follows:
data2=data %>%
group_by(UniqueID) %>%
mutate(Skip = if_else(Month - lag(Month, default = first(Month) - 1) - 1 > 0, 1, 0),
CountSkip = cumsum(Skip))
data2 = data2%>%
group_by(UniqueID) %>%
mutate(LastValue = if_else(Month == last(Month), 1, 0))
data2=as.data.frame(data2)
data2$FinalTally=ifelse(data2$LastValue==1 & data2$Month!=14,1,0)
data2$SeqCount=data2$FinalTally+data2$CountSkip
This was the resulting dataset:
UniqueID Month Skip CountSkip LastValue FinalTally SeqCount
ABC123 1 0 0 0 0 0
ABC123 2 0 0 0 0 0
ABC123 3 0 0 0 0 0
ABC123 4 0 0 0 0 0
ABC123 6 1 1 0 0 1
ABC123 7 1 2 1 1 2
DEF456 3 0 0 0 0 0
DEF456 4 0 0 0 0 0
DEF456 10 1 1 0 0 1
DEF456 11 1 1 0 0 1
DEF456 12 1 1 0 0 1
DEF456 14 2 2 1 0 2
GHI789 2 0 0 0 0 0
GHI789 3 0 1 1 1 1
JKL012 12 0 0 0 0 0
JKL012 13 0 0 0 0 0
JKL012 14 0 0 1 0 0
This is what I wanted...or so I thought.
When adding in new data for the next month (15), I edited the second to last line of my code to account for 15 being the new final month. However, I noticed the sum of SeqCount by Month differed from the sum of that same month before the new data was added. I filtered down to one month and found an example of one UniqueID where the SeqCount sum had differed.
Here is an example before the new data was included:
UniqueID Month Skip CountSkip LastValue FinalTally SeqCount
ZZZ999 2 0 0 0 0 0
ZZZ999 3 0 0 0 0 0
ZZZ999 4 0 0 0 0 0
ZZZ999 5 0 0 0 0 0
ZZZ999 6 0 0 1 1 1
Here is the example when the new data was included:
UniqueID Month Skip CountSkip LastValue FinalTally SeqCount
ZZZ999 2 0 0 0 0 0
ZZZ999 3 0 0 0 0 0
ZZZ999 4 0 0 0 0 0
ZZZ999 5 0 0 0 0 0
ZZZ999 6 0 0 0 0 0
ZZZ999 15 1 1 1 0 1
This is the problem: Month 6 loses a value of SeqCount when new data is added in.
My ultimate goal is to run a regression model for each month with SeqCount as the response with some other columns as predictors (I didn't include them for ease of reading). Whenever I add new data in, the response will change and my estimates will not be consistent.
Is there a way I can structure my code differently so when I add new data, the logic of the code does not change the information from previous values of SeqCount?
Any help would be appreciated!
Thank you!
The following seems to reproduce what you want without hardcoding any values into the logic.
Note - as others have commented, there appears to be a discrepancy between the results in the question and the code provided. For example, in the question post, UniqueID GHI789 for Month 3 has a CountSkip value of 1 though the code returns 0. The code in this answer returns 0.
data <- cbind.data.frame(UniqueID = c('ABC123','ABC123','ABC123','ABC123','ABC123','ABC123','DEF456','DEF456','DEF456','DEF456','DEF456','DEF456','GHI789','GHI789','JKL012','JKL012','JKL012'),
Month = c(1,2,3,4,6,7,3,4,10,11,12,14,2,3,12,13,14))
cartesian <- expand.grid(UniqueID = unique(as.character(data$UniqueID)),
Month = seq(from=min(data$Month), to=max(data$Month), by=1))
BA <- cartesian %>%
left_join(data %>% mutate(Month_orig=Month), by=c("UniqueID","Month")) %>%
arrange(UniqueID, Month) %>%
group_by(UniqueID) %>%
mutate(Skip = ifelse(Month==Month_orig &
is.na(lag(Month_orig,1)) &
Month!=min(Month[!is.na(Month_orig)]),
1, 0)) %>%
mutate(Skip = ifelse(Skip==1 & is.na(lag(Month_orig,2)), 0, Skip)) %>% # contstrain to only one skipped period?
filter(!is.na(Month_orig)) %>%
mutate(CountSkip = cumsum(Skip)) %>%
mutate(LastValue = ifelse(Month==max(Month), 1, 0)) %>%
mutate(FinalTally = ifelse(LastValue==1 & Month != max(Month),1,0)) %>%
mutate(SeqCount = FinalTally + CountSkip) %>%
select(-Month_orig)
BA
# A tibble: 17 x 7
# Groups: UniqueID [4]
UniqueID Month Skip CountSkip LastValue FinalTally SeqCount
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC123 1 0 0 0 0 0
2 ABC123 2 0 0 0 0 0
3 ABC123 3 0 0 0 0 0
4 ABC123 4 0 0 0 0 0
5 ABC123 6 1 1 0 0 1
6 ABC123 7 0 1 1 0 1
7 DEF456 3 0 0 0 0 0
8 DEF456 4 0 0 0 0 0
9 DEF456 10 0 0 0 0 0
10 DEF456 11 0 0 0 0 0
11 DEF456 12 0 0 0 0 0
12 DEF456 14 1 1 1 0 1
13 GHI789 2 0 0 0 0 0
14 GHI789 3 0 0 1 0 0
15 JKL012 12 0 0 0 0 0
16 JKL012 13 0 0 0 0 0
17 JKL012 14 0 0 1 0 0
In reading the question and comments more closely - I agree that a reference to the previous version of the table would be required to implement a solution. You might try rbind(old_data,new_rows) where new_rows is the result of processing all data and keeping only the unique(UniqueID, Month) not contained in the old_data. I don't fully follow the logic or intent here so perhaps that is not a good idea.