Search code examples
rgenomeoverlapping-matches

If position from File B is found in interval of file A, print to new dataframe


I have two files.

File A with intervals (regions on the genome)

  chr  startpos    endpos nMajor nMinor
1   1    762273 120612006      1      0
2   1 144854594 187610698      2      1
3   1 193051685 249120684      1      1
4   2     45895 242836535      1      1
5   3    361508 197566254      1      1
6   4     86022 190862182      1      1

File B with positions (of mutations)

     mutation_id chr    start      end ref_counts var_counts
1  1_3649563_G/T   1  3649563  3649564        551        159
2  1_6196895_G/C   1  6196895  6196895         85         30
3 1_10678395_C/T   1 10678395 10678395        274         60
4 1_11090913_G/C   1 11090913 11090914         70         41
5 1_11772423_G/A   1 11772423 11772423        146         55
6 1_12316528_C/G   1 12316528 12316528        110         88

Now I want to combine the two files to File C that adds the info of nMajor and nMinor from FileA to FileB if the position falls into the according interval.

So I will need to first check if the chromosome is the same and then if the start and end position in FileB are within the interval of FileA.

My output should be: File C

 mutation_id chr    start      end   ref_counts   var_counts  nMajor  nMinor
1  1_3649563_G/T   1  3649563  3649563        551        159    1      0
2  1_6196895_G/C   1  6196895  6196895         85         30    1      0
3 1_10678395_C/T   1 10678395 10678395        274         60    1      0
4 1_11090913_G/C   1 11090913 11090913         70         41    1      0
5 1_11772423_G/A   1 11772423 11772423        146         55    1      0
6 1_12316528_C/G   1 12316528 12316528        110         88    1      0

For lines of FileB that cannot be found in an interval of FileB, I want to print "X" as a placeholder.


Solution

  • You could use a fuzzyjoin for this task:

    library(dplyr)
    library(fuzzyjoin)
    
    file_b %>% 
      fuzzy_left_join(file_a,
                      by = c("chr" = "chr",
                             "start" = "startpos",
                             "end" = "endpos",
                             "start" = "endpos", 
                             "end" = "startpos"),
                      match_fun = list(`==`, `>`, `<`, `<`, `>`)) %>% 
      select(-startpos, -endpos, -chr.y) %>% 
      rename(chr = chr.x)
    

    I didn't create an X for a non-match, because this breaks the class of columns nMajor, nMinor and converts them into a string/character. I don't think it's a good idea here and NA values can easily be handled.

    This returns

    # A tibble: 7 x 8
      mutation_id      chr    start      end ref_counts var_counts nMajor nMinor
      <chr>          <dbl>    <dbl>    <dbl>      <dbl>      <dbl>  <dbl>  <dbl>
    1 1_3649563_G/T      1  3649563  3649564        551        159      1      0
    2 1_6196895_G/C      1  6196895  6196895         85         30      1      0
    3 1_10678395_C/T     1 10678395 10678395        274         60      1      0
    4 1_11090913_G/C     1 11090913 11090914         70         41      1      0
    5 1_11772423_G/A     1 11772423 11772423        146         55      1      0
    6 1_12316528_C/G     1 12316528 12316528        110         88      1      0
    7 ABC                2      123      456          2          3     NA     NA
    

    Data

    file_a <- structure(list(chr = c(1, 1, 1, 2, 3, 4), startpos = c(762273, 
    144854594, 193051685, 45895, 361508, 86022), endpos = c(120612006, 
    187610698, 249120684, 242836535, 197566254, 190862182), nMajor = c(1, 
    2, 1, 1, 1, 1), nMinor = c(0, 1, 1, 1, 1, 1)), row.names = c(NA, 
    -6L), class = c("tbl_df", "tbl", "data.frame"))
    

    In file_b I added a fake dataset to provoke a non-match:

    file_b <- structure(list(mutation_id = c("1_3649563_G/T", "1_6196895_G/C", 
    "1_10678395_C/T", "1_11090913_G/C", "1_11772423_G/A", "1_12316528_C/G", 
    "ABC"), chr = c(1, 1, 1, 1, 1, 1, 2), start = c(3649563, 6196895, 
    10678395, 11090913, 11772423, 12316528, 123), end = c(3649564, 
    6196895, 10678395, 11090914, 11772423, 12316528, 456), ref_counts = c(551, 
    85, 274, 70, 146, 110, 2), var_counts = c(159, 30, 60, 41, 55, 
    88, 3)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
    ))