Search code examples
rdataframeconditional-statementsdata-manipulation

Generating True/False column in one dataframe based on multiple conditions in another dataframe in R


I have two dataframes of different length. The first looks like this and is the dataframe I want to add the True/False column to:

chr_snp    loc_snp     ma_snp
1          184319928   T
1          276998062   A
1          278255864   G
2          243012470   G
2          123072103   T
3          526785124   A

The second data frame is the reference dataframe that is smaller:

chr_QTL    loc_QTL     ma_QTL
1          281788173   G   
1          203085725   C
2          241577141   C

For each row in dataframe 1 (df1), I want to first check if the value of df1$chr_snp matches a value in df2$chr_QTL. If this match is true, then I want to determine if the value in df1$loc_snp is within 10 million units (these are DNA base-pairs) above OR below any values based on the first condition in df2$loc_QTL. Now, what is tricky is that for the first three rows of df1, there are three possible row matches in df2 (rows 1 and 2) based on the first criteria alone. However, only two match based on the second criteria (10M base-pairs greater than OR less than value in df2$loc_QTL). Note: df1$ma_snp and df2$ma_QTL can be totally ignored. So, based on these criteria, df1 should now look like:

chr_snp    loc_snp     ma_snp   Match
1          184319928   T        FALSE
1          276998062   A        TRUE
1          278255864   G        TRUE
2          243012470   G        TRUE
2          123072103   T        FALSE
3          526785124   A        FALSE

Solution

  • Here is one option. Join the data then look for any instance of < 10,000,000.

    library(tidyverse)
    
    left_join(df1 |>
                mutate(rw_id = row_number()), 
              df2, by = c("chr_snp" = "chr_QTL"))  |>
      mutate(less = abs(loc_snp -loc_QTL) < 10e6) |>
      group_by(rw_id)|>
      summarise(across(contains(colnames(df1)), ~.[[1]]),
                Match = any(less),
                Match = ifelse(is.na(Match), FALSE, Match))
    #> # A tibble: 6 x 5
    #>   rw_id chr_snp   loc_snp ma_snp Match
    #>   <int>   <dbl>     <dbl> <chr>  <lgl>
    #> 1     1       1 184319928 T      FALSE
    #> 2     2       1 276998062 A      TRUE 
    #> 3     3       1 278255864 G      TRUE 
    #> 4     4       2 243012470 G      TRUE 
    #> 5     5       2 123072103 T      FALSE
    #> 6     6       3 526785124 A      FALSE
    

    or another option:

    library(tidyverse)
    
    df1 |>
      mutate(Match = map2_lgl(chr_snp, loc_snp, 
                              \(x, y){
                                (x %in% df2$chr_QTL) && 
                                  any(abs(df2[df2$chr_QTL == x, 'loc_QTL']-y) < 10e6)
                              }))
    #> # A tibble: 6 x 4
    #>   chr_snp   loc_snp ma_snp Match
    #>     <dbl>     <dbl> <chr>  <lgl>
    #> 1       1 184319928 T      FALSE
    #> 2       1 276998062 A      TRUE 
    #> 3       1 278255864 G      TRUE 
    #> 4       2 243012470 G      TRUE 
    #> 5       2 123072103 T      FALSE
    #> 6       3 526785124 A      FALSE
    

    just for fun. Here is a third option with base:

    df1$Match <- apply(outer(df1$loc_snp, df2$loc_QTL, \(x,y) abs(x-y) < 10e6) &
     outer(df1$chr_snp, df2$chr_QTL, `==`), 1, any)
      
    df1
    #> # A tibble: 6 x 4
    #>   chr_snp   loc_snp ma_snp Match
    #>     <dbl>     <dbl> <chr>  <lgl>
    #> 1       1 184319928 T      FALSE
    #> 2       1 276998062 A      TRUE 
    #> 3       1 278255864 G      TRUE 
    #> 4       2 243012470 G      TRUE 
    #> 5       2 123072103 T      FALSE
    #> 6       3 526785124 A      FALSE