Search code examples
rsapply

map gene positions to chromosome coordinates


First post here so I hope I can explain myself at the best.

I need to cross-reference two dataframes by finding if one specific chromosome location given in one of the two dataframes occurs in the range provided by the other one, and as result I would like to have a new column with the gene present in that range.

"genes"is the dataframe with the coordinates (start/end) to be considered as the range

head(genes)
# A tibble: 6 x 9
  chr   source         type      start       end strand gene_id         symbol        gene_biotype  
  <chr> <chr>          <chr>     <int>     <int> <chr>  <chr>           <chr>         <chr>         
1 2     pseudogene     gene  143300987 143301544 +      ENSG00000228134 AC092578.1    pseudogene    
2 2     pseudogene     gene  143611664 143613567 +      ENSG00000229781 AC013444.1    pseudogene    
3 2     protein_coding gene  143635067 143799890 +      ENSG00000115919 KYNU          protein_coding
4 2     pseudogene     gene  143704869 143705655 -      ENSG00000270390 RP11-470B22.1 pseudogene    
5 2     miRNA          gene  143763269 143763360 -      ENSG00000221169 AC013444.2    miRNA         
6 2     protein_coding gene  143848931 144525921 +      ENSG00000075884 ARHGAP15      protein_coding

the other data frame(x) is:

  chr_a   point A
1     2 143301002 
2     2 143625061
3     2 143700941
4     2 143811317
5     2 144127323
6     2 144224689

I basically have to find whether "point A" falls between "start"/ "end" range in (genes) and which gene symbol is associated.

I tried the following:

x$geneA <- ifelse(sapply(x$`point A`, function(g)
  any(genes$start >= g & genes$end <=g)), genes$symbol, NA)

but the results I get are not in line with the genomic coordinates.

Hope someone can help me! Thx!


Solution

  • Does this work?

    I'm assuming that each point matches to only one gene symbol.

    x$geneA <- sapply(x$`point A`,
                      function(g) filter(genes, g >= start & g <= end)$symbol[1])
    

    Result:

    x
    
    # A tibble: 6 x 3
      chr_a `point A` geneA     
      <int>     <int> <chr>     
    1     2 143301002 AC092578.1
    2     2 143625061 NA        
    3     2 143700941 KYNU      
    4     2 143811317 NA        
    5     2 144127323 ARHGAP15  
    6     2 144224689 ARHGAP15