I need to add a column containing gene names to a dataframe with information about isoform abundance. I have two tables.
DF 1 (contains ensemble gene ID #s as the rownames and asorted isoform abundance values in about 15 additional columns)
event_name sample1_posterior_mean
gene:ENSMUSG00000079523 0.93,0.02,0.00,0.06 0.90,0.01,0.00,0.04
gene:ENSMUSG00000078572 0.78 0.67
gene:ENSMUSG00000022548 0.63 0.25
DF 2 (contains 3 columns ensemble gene ID #s and gene names)
Ensemble_Transcript_ID Ensemble_Gene_ID External_Gene_ID
2335 ENSMUST00000101973 ENSMUSG00000096659 Gm25679
2336 ENSMUST00000179019 ENSMUSG00000095915 n-R5s115
2337 ENSMUST00000183908 ENSMUSG00000099299 Gm27722
2338 ENSMUST00000044752 ENSMUSG00000039481 Nrtn
2339 ENSMUST00000179157 ENSMUSG00000095476 Gm25077
I would like to add the External_Gene_ID column from DF 2 to the appropriate Ensemble_Gene_ID column in DF 1. I know there is a way to merge these two data frames together based on a column of interest
I hope i have explained this in sufficient detail. Thank you for your help!
A dplyr
solution (maybe) is to first create the Ensemble_Gene_ID
column in DF1
, and then use dplyr
to join based on that column. The left join is such that items from DF2 that match DF1 will become part of the new DF, but those gene ID's unique to DF2 will not be (i.e., DF1 is sort of the template that you care about, such that the output will maintain the original nrow of DF1).
DF1$Ensemble_Gene_ID <- sapply(rownames(DF1),
function(x){unlist(strsplit(x,':'))[2]})})
DF3 <- dplyr::left_join(DF1, DF2, by=c('Ensemble_Gene_ID'='External_Gene_ID'))
but I'm unsure if this works without an example chunk of each data.frame.