Search code examples
rfiltergroup-by

Filter grouped data by partial string specified in another column


I want to filter grouped data using either a.) a partial string specified in another column, or if easier, b.) a partial string which I specify in the code.

I have the following data frame:

  df <- structure(list(
  sen = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), 
  trial = c("standard", "standard", "standard", "standard", "standard", "silence", "silence", "silence", "silence", "silence", "deviant", "deviant", "deviant", "deviant", "deviant","standard", "standard", "standard", "standard", "standard", "silence", "silence", "silence", "silence", "silence", "deviant", "deviant", "deviant", "deviant", "deviant"),
  ppt = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3), 
  ia_label = c("The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5","The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5", "The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5", "The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5","The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5", "The_TW1", "cow_TW2", "jumped_TW3", "the_TW4", "gate_TW5"), 
  target_pos = c("0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "TW3", "TW3", "TW3", "TW3", "TW3", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "TW4", "TW4", "TW4", "TW4", "TW4")),
  .Names = c("sen","trial","ppt","ia_label","target_pos"),
  row.names = c(NA, -30L),
  class = "data.frame")
sen trial ppt ia_label target_pos
1 standard 1 The_TW1 0
1 standard 1 cow_TW2 0
1 standard 1 jumped_TW3 0
1 standard 1 the_TW4 0
1 standard 1 gate_TW5 0
1 silence 2 The_TW1 0
1 silence 2 cow_TW2 0
1 silence 2 jumped_TW3 0
1 silence 2 the_TW4 0
1 silence 2 gate_TW5 0
1 deviant 3 The_TW1 TW3
1 deviant 3 cow_TW2 TW3
1 deviant 3 jumped_TW3 TW3
1 deviant 3 the_TW4 TW3
1 deviant 3 gate_TW5 TW3
2 standard 1 The_TW1 0
2 standard 1 cow_TW2 0
2 standard 1 jumped_TW3 0
2 standard 1 the_TW4 0
2 standard 1 gate_TW5 0
2 silence 2 The_TW1 0
2 silence 2 cow_TW2 0
2 silence 2 jumped_TW3 0
2 silence 2 the_TW4 0
2 silence 2 gate_TW5 0
2 deviant 3 The_TW1 TW4
2 deviant 3 cow_TW2 TW4
2 deviant 3 jumped_TW3 TW4
2 deviant 3 the_TW4 TW4
2 deviant 3 gate_TW5 TW4

and I want to filter the data frame by 'ia_label's that contain the string specified in the target_pos for deviant conditions (either tw3 or tw4) - but I want to group this by 'sen' - so for all of sen = 1, I want to keep only the rows with ia_label containing _TW3, and for sen = 2 I want to keep only the rows with ia_label containing _TW4:

sen trial ppt ia_label target_pos
1 standard 1 jumped_TW3 0
1 silence 2 jumped_TW3 0
1 deviant 3 jumped_TW3 TW3
2 standard 1 the_TW4 0
2 silence 2 the_TW4 0
2 deviant 3 the_TW4 TW4

I only have a small number of different strings that I need to filter by, so I don't mind running this manually by specifying the partial string I want to filter 'ia_label' by, if it isn't possible to filter each group by the partial string specified within the 'target_pos' column.

I have tried using the following code using group_by, filter and grepl but I receive the error below:

library(dplyr)
Df2 <- df %>% 
  group_by(sen) %>%
  filter(df, grepl("TW3",ia_label))   

Output:

Error: Problem with filter() input ..1. x Input ..1 must be of size 15 or 1, not size 30. i Input ..1 is df. i The error occurred in group 1: sen = 1. Run rlang::last_error() to see where the error occurred.


Solution

  • The error is because you are both piping df into filter and specifying it inside filter. It will avoid the error if you change filter(df, grepl(... to filter(grepl(....

    df %>%
      group_by(sen) %>% 
      filter(grepl("TW3", ia_label))
    

    To do this for the first target_pos value corresponding to a trial == "deviant" by group, do this:

    df %>% 
      group_by(sen) %>%
      filter(grepl(
        pattern = first(target_pos[trial == "deviant"]),
        x = ia_label
      ))
    # # A tibble: 6 × 5
    # # Groups:   sen [2]
    #     sen trial      ppt ia_label   target_pos
    #   <dbl> <chr>    <dbl> <chr>      <chr>     
    # 1     1 standard     1 jumped_TW3 0         
    # 2     1 silence      2 jumped_TW3 0         
    # 3     1 deviant      3 jumped_TW3 TW3       
    # 4     2 standard     1 the_TW4    0         
    # 5     2 silence      2 the_TW4    0         
    # 6     2 deviant      3 the_TW4    TW4