Search code examples
rdataframeconditional-statementsdata-manipulation

Prune list of pairs in R dataframe by removing redundancies based on genomic distance and p-value


I have an R dataframe containing pairs of SNPs (with chromosome number and base pair location in separate columns) and p-values for each pair. The data looks like this:

chr_snp1     loc_snp1     chr_snp2     loc_snp2     pval
1            278474050    2            57386477     7.43e-08
1            275620856    2            57386662     1.08e-07
1            144075771    3            109909704    1.02e-06
1            144075771    3            111344453    2.06e-06
2            103701229    7            56369738     3.83e-08
2            102990566    7            56407691     1.07e-07

I want to remove redundancies in pairs. However, redundancy being defined as a similar pair that is close to another determined by the values in the loc_snp1 and loc_snp2 columns as well as chr_snp1 and chr_snp2. The numeric limit I want to impose for values in loc_snp1 and loc_snp2 is 10,000,000 for pairs on the same combination of chromosomes. This data is sorted by chromosome (chr_snp1 and chr_snp2), then base pair location (loc_snp1 and loc_snp2), then by p-value (pval).

Basically I want the script to check if the value in loc_snp1 is within 10,000,000 of the value in loc_snp1 in the row above and if they both have the same value in chr_snp1. Then, also check if the value in loc_snp2 is within 10,000,000 of the value in loc_snp2 in the row above and if they both have the same value in chr_snp2. If all four conditions are true, only keep the row with the lowest pval and discard the other.

In order words, only keep the best pair and remove all the others that are close (have the same chromosome combination and within 10,000,000 base pairs of other pairs on their respective chromosomes).

This would result in the dataframe looking like:

chr_snp1     loc_snp1     chr_snp2     loc_snp2     pval
1            278474050    2            57386477     7.43e-08
1            144075771    3            109909704    1.02e-06
2            103701229    7            56369738     3.83e-08

Of course, the script doesn't have to actually check the row above. I assume there are more elegant ways to approach the problem.


Solution

  • Here is option. If you already know chr_snp1 and chr_snp2 need to be identical, you can start by grouping them and then calculating the the differences. Admittedly, this might fail if you have a duplicate chr_snp1 somewhere else in the dataset.

    library(tidyverse)
    
    df |>
      group_by(chr_snp1, chr_snp2) |>
      mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 10e6) &
               (abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 10e6)) |>
      group_by(grp, .add=TRUE)  |>
      filter(pval == min(pval)) |>
      ungroup()|>
      select(-grp)
    #> # A tibble: 3 x 5
    #>   chr_snp1  loc_snp1 chr_snp2  loc_snp2         pval
    #>      <dbl>     <dbl>    <dbl>     <dbl>        <dbl>
    #> 1        1 278474050        2  57386477 0.0000000743
    #> 2        1 144075771        3 109909704 0.00000102  
    #> 3        2 103701229        7  56369738 0.0000000383