Search code examples
rdplyrgroupingcumsum

R: Remove subsequent rows after specific occurences of character by group


I have a large dataset similar to the following one.

ID Date Change wage
201 04.01.2023 B-AL 900
201 01.03.2023 AL-AL 900
201 01.10.2023 AL-AL 900
201 10.11.2023 AL-B 1900
201 28.11.2023 B-AL 900
201 10.12.2023 AL-B 1905
301 03.01.2023 B-AL 1100
301 11.02.2023 AL-B 2500
301 15.04.2023 B-AL 1110
301 27.09.2023 AL-B 2506
401 02.01.2023 B-AL 700
401 28.01.2023 AL-X 0
401 10.02.2023 AL-AL 700
401 02.04.2023 AL-AL 700
401 10.05.2023 AL-B 2200
401 13.08.2023 B-AL 800

I want to calculate the average wage by each status, but with two restirctions: First, I only want to consider statuses of each individual until the status "B-AL" is reached the second time. All rows after the second occurance of "B-AL" by each ID should not be considered in the calculation.

(I think) I solved this by creating two help variables and then filter by those with cumsum()

data <- data %>%
  arrange(date) %>%
  mutate(num = row_number(),
         num2 = ifelse(change == "B-AL" & num == 2, 1, 0),
         .by = c(ID, change))

data <- data %>%
  group_by(ID) %>%
  filter(cumsum(num2) == 0) %>% 
  ungroup()

Now the second restiction: As soon as one ID has status "AL-X" all subseqent rows should not bne considered as well.

If I use the same approach as above I also loose the observation where "AL-X" occurs the first time. However, I want this one to be included and all subsequent ones to be excluded. After both operations the table should look like this

ID Date Change wage
201 04.01.2023 B-AL 900
201 01.03.2023 AL-AL 900
201 01.10.2023 AL-AL 900
201 10.11.2023 AL-B 1900
301 03.01.2023 B-AL 1100
301 11.02.2023 AL-B 2500
401 02.01.2023 B-AL 700
401 28.01.2023 AL-X 0

After that I can calculate the average wages with summarise (first by status and then by ID)

Any help is appreciated. I am also not sure whether the first operation is sound and efficient.


Solution

  • Judging by the desired output you want

    • remove everything when "B-AL" occurred the second time including that line and
    • everything removed after the first occurrence of "AL-X".
    library(dplyr)
    
    df %>% 
      filter(cumsum(Change == c("B-AL")) < 2, 
             lag(cumsum(Change == c("AL-X")), default=0) < 1, .by = ID)
       ID       Date Change wage
    1 201 04.01.2023   B-AL  900
    2 201 01.03.2023  AL-AL  900
    3 201 01.10.2023  AL-AL  900
    4 201 10.11.2023   AL-B 1900
    5 301 03.01.2023   B-AL 1100
    6 301 11.02.2023   AL-B 2500
    7 401 02.01.2023   B-AL  700
    8 401 28.01.2023   AL-X    0