Search code examples
rdataframedplyr

using apply and filter to R dataframe


I am new to R and I was wondering how to use a clever way to first apply a function on two columns of a dataframe and then filter for certain criteria.

So the initial list looks like this:

Sample1 Sample2 Sisters Value
100_99 200_98 Yes 20
101_99 200_98 Yes 20
102_99 200_98 Yes 20
103_99 201_98 Yes 20
104_99 201_98 Yes 20
200_99 100_98 Yes 20
100_99 100_98 Yes 20

and I want it to look like this:

Sample1 Sample2 Sisters Value
100_99 200_98 Yes 20
200_99 100_98 Yes 20
100_99 100_98 Yes 20

I additionally have an array:

toCheck <- [100, 200]

What I want to do:

1st) Just take the first number of the string (until the _) of both of the strings in the Columns "Sample1" and "Sample2"

2nd) Check if the numbers are in the array "toCheck"

3rd) Keep all rows that have a number in each of theses columns

I tried many things and I do not know whether piping is the right option: (I have written a function that just takes the first number of the string)

qq <-  df %>% 
  df $Sample1 <-lapply(df $Sample1, functionToJustTakeTheFirstNumber)
  df $Sample2 <-lapply(df $Sample2, functionToJustTakeTheFirstNumber)
  filter(Sample1 %in% toCheck && Sample2 %in% toCheck ) 

I always get funny error messages like

Error in match.fun(FUN) :

'df$Sample2' is not a function, character or symbol


Solution

  • A pure tidyverse approach would be to use dplyr::filter, dplyr::if_all and stringr::str_split_i like so:

    library(dplyr, warn = FALSE)
    library(stringr)
    
    toCheck <- c(100, 200)
    
    df |>
      filter(
        if_all(
          c(Sample1, Sample2),
          ~ stringr::str_split_i(.x, "_", 1) %in% toCheck
        )
      )
    #>   Sample1 Sample2 Sisters Value
    #> 1  100_99  200_98     Yes    20
    #> 2  200_99  100_98     Yes    20
    #> 3  100_99  100_98     Yes    20
    

    DATA

    df <- data.frame(
      stringsAsFactors = FALSE,
      Sample1 = c(
        "100_99", "101_99", "102_99",
        "103_99", "104_99", "200_99", "100_99"
      ),
      Sample2 = c(
        "200_98", "200_98", "200_98",
        "201_98", "201_98", "100_98", "100_98"
      ),
      Sisters = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
      Value = c(20L, 20L, 20L, 20L, 20L, 20L, 20L)
    )