Search code examples
rlagrankinglead

Choosing a single row based on multiple criteria


Can anyone think how to how to do this in R?

Simple data:

    seq<-c("A","A","A","B","B","B","B")
    rank<-c(1,2,3,1,2,3,4)
    match<-c("y","n","y","n","n","y","y")
    df<- as.data.frame(cbind(seq,rank,match))

      seq rank match
    1   A    1     y
    2   A    2     n
    3   A    3     y
    4   B    1     n
    5   B    2     n
    6   B    3     y
    7   B    4     y

I want to create a ‘choose’ column whereby, for every unique seq, the first instance of y under match is given a T, and all the rest are given a F.

The desired output would be:

      seq rank match choose
    1   A    1     y      T
    2   A    2     n      F
    3   A    3     y      F
    4   B    1     n      F
    5   B    2     n      F
    6   B    3     y      T
    7   B    4     y      F

It’s easy enough to return the correct value for rank 1 and 2 by combining an ifelse statement with lag, however I'm stumped once rank >2.

The real dataset contains+100k rows, and rank could go up to the hundreds, so I don’t want to just extend an ifelse statement to check values above using lag.

My ultimate aim is to create a new dataset from all of the T's under "choose", so if someone know how to directly pull them out without creating a new column that would be even better!

I’m guessing an ifelse statement is a daft way of doing this, but I’m stuck :/

Any help would be much appreciated : )


Solution

  • One dplyr possibility could be:

    df %>%
     group_by(seq) %>%
     mutate(choose = +(match == "y") * (cumsum(match == "y") == 1))
    
      seq   rank  match choose
      <fct> <fct> <fct>  <int>
    1 A     1     y          1
    2 A     2     n          0
    3 A     3     y          0
    4 B     1     n          0
    5 B     2     n          0
    6 B     3     y          1
    7 B     4     y          0
    

    And if you want to have TRUE/FALSE values:

    df %>%
     group_by(seq) %>%
     mutate(choose = as.logical(+(match == "y") * (cumsum(match == "y") == 1)))
    
      seq   rank  match choose
      <fct> <fct> <fct> <lgl> 
    1 A     1     y     TRUE  
    2 A     2     n     FALSE 
    3 A     3     y     FALSE 
    4 B     1     n     FALSE 
    5 B     2     n     FALSE 
    6 B     3     y     TRUE  
    7 B     4     y     FALSE
    

    And the same with base R:

    with(df, ave(match, seq, FUN = function(x) +(x == "y") * (cumsum(x == "y") == 1)))
    

    Or:

    with(df, ave(match, seq, FUN = function(x) as.logical(+(x == "y") * (cumsum(x == "y") == 1))))