Search code examples
rstatacase-whendplyr

Complex conditional mutating: Create new variable with conditional mutating using only past observations for a given participant?


I have a dataframe (fbwb) with multiple assessments of bullying (1-6) using multiple measures (1-3) in a group of participants. The df looks like this:

fbwb <- read.table(text="id year bully1 bully2 bully3 cbully bully_ever 
100 1 NA 1 NA 1 1
100 2 1 1 NA 1 1
100 3 NA 0 NA 0 1
101 1 NA NA 1 1 1
102 1 NA 1 NA 1 1
102 2 NA NA NA NA 1
102 3 NA 1 1 1 1
102 4 0 0 0 0 1
103 1 NA 1 NA 1 1
103 2 NA 0 0 0 1", header=TRUE)

Where bully1, bully2, and bully3 are binary variables that each = 1 if bullying was reported on the respective measure. cbully is binary and = 1 if any of the 3 bullying variables = 1 for a given year. bully_ever is binary and = 1 if bullying was reported on any measure in any year for a given participant.

I want to create a new binary variable in my df called bully_past. bully_past represents the case when cbully = 1 in ANY PAST YEAR. This is subtly different from bully_ever. For example, if a participant has been assessed 4 times:

  • bully_past should use info from years 3, 2, and 1 AT YEAR 4.
  • bully_past should use info from years 2 and 1 AT YEAR 3.
  • bully_past should use info from year 1 AT YEAR 2.
  • bully_past should be NA at year 1.

I have tried quite a few things, but the most recent rendition is the following:

fbwb <- fbwb %>%
  dplyr::group_by(id) %>%
  dplyr::mutate(bully_past = case_when(cbully == 1 & year == (year - 1) |
                                         cbully == 1 & year == (year - 2) |
                                         cbully == 1 & year == (year - 3) |
                                         cbully == 1 & year == (year - 4) |
                                         cbully == 1 & year == (year - 5) ~ 1,
                                       (is.na(cbully) & year == (year - 1) &
                                         is.na(cbully) & year == (year - 2) &
                                         is.na(cbully) & year == (year - 3) &
                                         is.na(cbully) & year == (year - 4) &
                                         is.na(cbully) & year == (year - 5)) ~ NA_real_,
                                       TRUE ~ 0)) %>%
  dplyr::ungroup()

This does not work because the syntax for indicating which years to use is not correct - so it generates a column of NA values. I have made other attempts, but I have not been able to manage to take into account observations from ALL PREVIOUS YEARS.

It can be done in Stata using this code:

gen bullyingever = bullying
sort iid time
replace bullyingever = 1 if bullying[_n - 1]==1 & iid[_n - 1]==iid
replace bullyingever = 1 if bullying[_n - 2]==1 & iid[_n - 2]==iid
replace bullyingever = 1 if bullying[_n - 3]==1 & iid[_n - 3]==iid
replace bullyingever = 1 if bullying[_n - 4]==1 & iid[_n - 4]==iid
replace bullyingever = 1 if bullying[_n - 5]==1 & iid[_n - 5]==iid

I appreciate any input on how to accomplish this in R, preferably using dplyr.


Solution

  • Here we can write a helper function that can look at previous events both using cumsum (to keep a cumulative account of events which lets you look into the past) and lag() in order to look exclusively behind the current value. So we have

    had_previous_event <- function(x) {
      lag(cumsum(!is.na(x) & x==1)>0)
    }
    

    You can then use that with your dplyr chain

    fbwb %>%
      arrange(id, year) %>% 
      group_by(id) %>%
      mutate(bully_past = had_previous_event(cbully))
    

    This returns TRUE/FALSE but if you want zero/one you can change that to

      mutate(bully_past = as.numeric(had_previous_event(cbully)))