Search code examples
rbioinformaticsbioconductorbiomart

Query genes within regions


I want to retrieve the genes that are present within a series of regions. Say, I have a bed file with query positions such like:

1     2665697     4665777      MIR201
1    10391435    12391516      MIR500
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

I would like to get the genes that fall within those regions.

I have tried using biomaRt, and bedtools intersect, but the output I get, is a list of genes corresponding to all the regions, not one by one, as the desired output I would like to get would be the genes within each row, but in separate rows, a if I did one query region at a time. Basically I want to know what genes fall within each region, but still being able to identify which genes fall in which regions.

What I am doing is, from a region of detected miRNA, I am expanding the genome region upwards and downwards, so that I get the neighboring genes from this miRNA. I am using a 1 million bases windows up and down. This would work for just one query, but, how to do many queries with biomaRt or many intersections with bedtools, so that I get somewhat like:

1     2665697     4665777      MIR201      GENEX, GENEY, GENEZ...
1    10391435    12391516      MIR500      GENEA, GENEB, GENEC...
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

Meaning that GENEX, GENEY and GENEZ fall within 1:2665697-4665777, with MIR201, placed in the middle, as this region is calculated subtracting 1 million bp to sart, and adding 1 million bp to end position.

I am somewhat determining the neighboring genes from each miRNA, to compare within species, but I do not get how to query multiple regions individually using biomaRt or bedtools.

Any help?


Solution

  • Same approach as @Jimbou without tidyverse:

    library(biomaRt)
    
    # data
    d <- read.table(text = "1     2665697     4665777      MIR201
    1    10391435    12391516      MIR500
    1    15106831    17106911      MIR122
    1    23436535    25436616      MIR234 
    1    23436575    25436656      MIR488")
    
    # specify the database
    ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
    
    # loop through rows, get genes, then paste with collapse,
    # and finally bind back with data d.
    res <- cbind(
      d,
      genes = apply(d, 1, function(i){
        x <- getBM(attributes=c("external_gene_name"), 
                   filters = c("chromosome_name" , "start", "end"), 
                   values = list(i[1], i[2], i[3]), 
                   mart = ensembl)
        # keeping only 3 genes, as output is too long.
        # In your case remove below line
        x <- head(x, 3)
    
        # return genes, comma separated
        paste(x$external_gene_name, collapse = ",")
      })
    )
    
    res
    #   V1       V2       V3     V4                       genes
    # 1  1  2665697  4665777 MIR201 TTC34,AC242022.1,AL592464.2
    # 2  1 10391435 12391516 MIR500   AL139424.2,PGD,AL139424.1
    # 3  1 15106831 17106911 MIR122      KAZN,TMEM51-AS1,TMEM51
    # 4  1 23436535 25436616 MIR234       ASAP3,E2F2,AL021154.1
    # 5  1 23436575 25436656 MIR488       ASAP3,E2F2,AL021154.1