I try to get the gene names out of a binding analysis of the 5'UTR. Therefore I have this little code. Until the vmatchPattern
everything works fine. At least I hope so.
library(biomaRt)
library(GenomicFeatures)
library(XVector)
library(Biostrings)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(BSgenome.Mmusculus.UCSC.mm10)
fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)
Mmusculus <- BSgenome.Mmusculus.UCSC.mm10
seqlevelsStyle(Mmusculus) <- 'ensembl'
seqlevelsStyle(fUTR) <- 'ensembl'
Seq <- getSeq(Mmusculus, fUTR)
Pbind <- RNAString('UGUGUGAAHAA')
Match <- vmatchPattern(Pbind, unlist2(Seq), max.mismatch = 0, min.mismatch = 0, with.indels = F, fixed = T, algorithm = 'auto')
Afterwards however I want to get the gene names to create a list in the end and use this in Python for further analysis of a RNAseq experiment. There comes a problem, I think I found so far three different ways on how to potentially do this. However none of them are working for me.
##How to get gene names from the match Pattern
#1
matches <- unlist(Match, recursive = T, use.names = T)
m <- as.matrix(matches)
subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20)
#2
transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id'))
#3
count_index <- countIndex(Match)
wh <- which(count_index > 0)
result_list = list()
for(i in 1: length(wh))
{
result_list[[i]] = Views(subject[[wh[i]]], mindex[[wh[i]]])
}
names(result_listF) = nm[wh]
I am happy to hear some suggestions and get some help or solution for this problem. I am no Bioinformation by training, so this took me already quite a while to figure this out.
So I found an answer, I hope this helps someone, and there is no mistake somewhere.
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
##get all 5’ UTR sequences
fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)
utr_ul <- unlist(fUTR, use.names = F)
mcols(utr_ul)$tx_id <- rep(as.integer(names(fUTR)), lengths(fUTR))
utr_ul
tx2gene <- mcols(transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id')))
tx2gene$gene_id <- as.character(tx2gene$gene_id)
m <- match(mcols(utr_ul)$tx_id, tx2gene$tx_id)
mcols(utr_ul) <- cbind(mcols(utr_ul), tx2gene[m, -1L, drop = F])
utr5_by_gene <- split(utr_ul, mcols(utr_ul)$gene_id)
seqs <- getSeq(Mmusculus, utr5_by_gene)
##search with motif UGUGUGAAHAA
motif <- DNAString('TGTGTGAAHAA')
x <- vmatchPattern(motif, unlist(seqs), fixed = F)
matches <- unlist(x, recursive = T, use.names = T)
##list all genes with matches
hits <- mapIds(org.Mm.eg.db, keys = unique(names(matches)), keytype = 'ENTREZID',
column = 'SYMBOL', multiVals = 'first')