Search code examples
rbiomart

R - Using biomaRt/getBM generates a list of 55,000+ genes instead of the ~15,000 I'm inputting as a data frame under "values"


I have downloaded an extensive dataset from NIH GEO and am attempting to convert the Ensembl names in the first column to MGI symbols

The table I've named SOD is shown below

SOD Data - Total rows = 15,396

I used the following code:

setwd("C:/R/Project")
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("biomaRt", version = "3.8")
library(BiocManager)
library(biomaRt)
SOD<-read.csv("Static Organoid Data.csv")
names_only<-data.frame(SOD[,1])
mart <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
Gene_list <- getBM(attributes = c("ensembl_gene_id", "mgi_symbol"),
                   values     = names_only, 
                   mart       = mart)
View(Gene_list)

This outputs a list of ensembl and MGI symbols with over 55,000 rows.

I have tried adding filter = "ensembl_gene_id into the getBM function but the output has 0 rows and 0 columns.

What am I doing wrong here?


Solution

  • Your ensembl IDs are versioned, meaning that they are of the form they have a .# whereas the ensembl ids in biomart aren't. To fix this you need to remove the .# at the end of the names as follows:

    names_only <- gsub("\\.*","",data.frame(SOD[,1]))
    mart <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
    Gene_list <- getBM(attributes = c("ensembl_gene_id", "mgi_symbol"),
                       values     = names_only,
                       filter     = "ensembl_gene_id",
                       mart       = mart)