Search code examples
rconditional-statementstime-seriesmutate

mutate column based on difference in col A, with rows identified by relative values in col B


I have the dataframe below, showing sampling events and when each individual was observed, I want a new column showing the number of sampling events between the current observation and the previous most recent observation of an individual.

so what I have is this:

tibble(ind=c(rep("A",10),rep("B",10)),
   event=c(c(1:10),c(1:10)),
   obs=c(NA,NA,1,2,NA,NA,NA,4,NA,5,1,NA,NA,NA,2,NA,3,4,NA,5))

    # A tibble: 20 × 3
   ind   event   obs
   <chr> <int> <dbl>
 1 A         1    NA
 2 A         2    NA
 3 A         3     1
 4 A         4     2
 5 A         5    NA
 6 A         6    NA
 7 A         7    NA
 8 A         8     4
 9 A         9    NA
10 A        10     5
11 B         1     1
12 B         2    NA
13 B         3    NA
14 B         4    NA
15 B         5     2
16 B         6    NA
17 B         7     3
18 B         8     4
19 B         9    NA
20 B        10     5

And what I want is this, where missed.events = (event at obs[i])-(event at obs[i-1])-1:

# A tibble: 20 × 4
   ind   event   obs missed.events
   <chr> <int> <dbl>         <dbl>
 1 A         1    NA            NA
 2 A         2    NA            NA
 3 A         3     1            NA
 4 A         4     2             0
 5 A         5    NA            NA
 6 A         6    NA            NA
 7 A         7    NA            NA
 8 A         8     4             3
 9 A         9    NA            NA
10 A        10     5             1
11 B         1     1            NA
12 B         2    NA            NA
13 B         3    NA            NA
14 B         4    NA            NA
15 B         5     2             3
16 B         6    NA            NA
17 B         7     3             1
18 B         8     4             0
19 B         9    NA            NA
20 B        10     5             1

I'm sure I am missing something obvious here, and I've tried various sorts of case_when() and if_else() conditions within a group_by(ind) %>% mutate() argument, but can't quite figure this out.

thanks in advance


Solution

  • Obtain the subset with missed events and join back to the original dataframe

    df %>%
      filter(!is.na(obs)) %>%
      mutate(missed.events=event-lag(event)-1, .by=ind) %>%
      right_join(df)
    
    Joining, by = c("ind", "event", "obs")
    # A tibble: 20 x 4
       ind   event   obs missed.events
       <chr> <int> <dbl>         <dbl>
     1 A         1    NA            NA
     2 A         2    NA            NA
     3 A         3     1            NA
     4 A         4     2             0
     5 A         5    NA            NA
     6 A         6    NA            NA
     7 A         7    NA            NA
     8 A         8     4             3
     9 A         9    NA            NA
    10 A        10     5             1
    11 B         1     1            NA
    12 B         2    NA            NA
    13 B         3    NA            NA
    14 B         4    NA            NA
    15 B         5     2             3
    16 B         6    NA            NA
    17 B         7     3             1
    18 B         8     4             0
    19 B         9    NA            NA
    20 B        10     5             1