I am new to R and I was wondering how to use a clever way to first apply a function on two columns of a dataframe and then filter for certain criteria.
So the initial list looks like this:
Sample1 | Sample2 | Sisters | Value |
---|---|---|---|
100_99 | 200_98 | Yes | 20 |
101_99 | 200_98 | Yes | 20 |
102_99 | 200_98 | Yes | 20 |
103_99 | 201_98 | Yes | 20 |
104_99 | 201_98 | Yes | 20 |
200_99 | 100_98 | Yes | 20 |
100_99 | 100_98 | Yes | 20 |
and I want it to look like this:
Sample1 | Sample2 | Sisters | Value |
---|---|---|---|
100_99 | 200_98 | Yes | 20 |
200_99 | 100_98 | Yes | 20 |
100_99 | 100_98 | Yes | 20 |
I additionally have an array:
toCheck <- [100, 200]
What I want to do:
1st) Just take the first number of the string (until the _) of both of the strings in the Columns "Sample1" and "Sample2"
2nd) Check if the numbers are in the array "toCheck"
3rd) Keep all rows that have a number in each of theses columns
I tried many things and I do not know whether piping is the right option: (I have written a function that just takes the first number of the string)
qq <- df %>%
df $Sample1 <-lapply(df $Sample1, functionToJustTakeTheFirstNumber)
df $Sample2 <-lapply(df $Sample2, functionToJustTakeTheFirstNumber)
filter(Sample1 %in% toCheck && Sample2 %in% toCheck )
I always get funny error messages like
Error in match.fun(FUN) :
'df$Sample2' is not a function, character or symbol
A pure tidyverse approach would be to use dplyr::filter
, dplyr::if_all
and stringr::str_split_i
like so:
library(dplyr, warn = FALSE)
library(stringr)
toCheck <- c(100, 200)
df |>
filter(
if_all(
c(Sample1, Sample2),
~ stringr::str_split_i(.x, "_", 1) %in% toCheck
)
)
#> Sample1 Sample2 Sisters Value
#> 1 100_99 200_98 Yes 20
#> 2 200_99 100_98 Yes 20
#> 3 100_99 100_98 Yes 20
DATA
df <- data.frame(
stringsAsFactors = FALSE,
Sample1 = c(
"100_99", "101_99", "102_99",
"103_99", "104_99", "200_99", "100_99"
),
Sample2 = c(
"200_98", "200_98", "200_98",
"201_98", "201_98", "100_98", "100_98"
),
Sisters = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
Value = c(20L, 20L, 20L, 20L, 20L, 20L, 20L)
)