Search code examples
rbioinformaticsbioconductorbiomart

Annotate positions using biomaRt


I have some genome positions and I want to annotate these positions (find Ensembl gene ID, the features like exonic, intronic, ...) based on Ensembl using biomaRt R package.

part of my data

  chr       start        stop     strand
chr10   100572320   100572373          -   
chr10   100572649   100572658          +   

Solution

  • Prepare you data to query biomaRt

    sample data

    data = data.frame(chr = "chr17", start = 63973115, end = 64437414)
    data$query = paste(gsub("chr",'',data$chr),data$start,data$end, sep = ":")
    
    #> data
    #    chr    start      end                query
    #1 chr17 63973115 64437414 17:63973115:64437414
    

    Then use biomaRt

    library(biomaRt)
    
    # select your dataset of interest accordingly. 
    # I have used human specific dataset identifier
    # you can see all available datasets using listDatasets(mart),
    # after setting your mart of interest
    
    mart = useMart(
             'ENSEMBL_MART_ENSEMBL', 
              host = 'ensembl.org', 
              dataset = 'hsapiens_gene_ensembl')
    
    # do listAttributes(mart) to list all information you can extract using biomaRt
    
    out = getBM(
            attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 
                           'ensembl_transcript_id', 'ensembl_exon_id'), 
            filters = 'chromosomal_region', 
            values = data$query, 
            mart = mart)
    

    This will give you the ensembl Ids for genes, transcripts, and exons present in given genomic location. biomaRt offers lot more information, so do not forget to use listAttributes() to find out all.