Search code examples
rpanel-data

Subset panel data with more than one match for a criteria over multiple columns


I have data over individuals that looks like this:

df=data.frame(Col1=c("A","A","B","B"),Col2=c("A50","C50","B60","A70"),Col3=c("A40","A50","A50","A70"))
criteria="A50"

  Col1 Col2 Col3
    A  A50  A40
    A  C50  A50
    B  A50  A50
    B  A70  A70

I want to pick each individual in Col1 that has the criteria fulfilled (A50) in any of the columns in at two different observations. That is, individual A will be selected since he has A50 in two different observations. Individual B will however not be selected since he only has A50 at in one observation, even if he became A50 two times during that observation.

The problem is an extension to this one: Subset multiple columns in R with multiple matches


Solution

  • Try using this with dplyr 1.0.0

    library(dplyr)
    
    cols <- c('Col2', 'Col3') 
    
    df %>%
      group_by(Col1) %>%
      filter(sum(colSums(cur_data()[cols] == criteria) >= 1) > 1)
    
    #  Col1  Col2  Col3 
    #  <chr> <chr> <chr>
    #1 A     A50   A40  
    #2 A     C50   A50  
    

    cur_data()[cols] selects only cols column, colSums counts number of matches in each column and sum ensures that the match is in different columns.