Search code examples
rbioinformaticsgenetics

Add gene names to dataframe, merge 2 dataframes on transcript ID column


I need to add a column containing gene names to a dataframe with information about isoform abundance. I have two tables.

DF 1 (contains ensemble gene ID #s as the rownames and asorted isoform abundance values in about 15 additional columns)

                              event_name              sample1_posterior_mean                        
gene:ENSMUSG00000079523       0.93,0.02,0.00,0.06     0.90,0.01,0.00,0.04
gene:ENSMUSG00000078572       0.78                    0.67
gene:ENSMUSG00000022548       0.63                    0.25

DF 2 (contains 3 columns ensemble gene ID #s and gene names)

          Ensemble_Transcript_ID  Ensemble_Gene_ID     External_Gene_ID
2335      ENSMUST00000101973      ENSMUSG00000096659            Gm25679
2336      ENSMUST00000179019      ENSMUSG00000095915           n-R5s115
2337      ENSMUST00000183908      ENSMUSG00000099299            Gm27722
2338      ENSMUST00000044752      ENSMUSG00000039481               Nrtn
2339      ENSMUST00000179157      ENSMUSG00000095476            Gm25077

I would like to add the External_Gene_ID column from DF 2 to the appropriate Ensemble_Gene_ID column in DF 1. I know there is a way to merge these two data frames together based on a column of interest

I hope i have explained this in sufficient detail. Thank you for your help!


Solution

  • A dplyr solution (maybe) is to first create the Ensemble_Gene_ID column in DF1, and then use dplyr to join based on that column. The left join is such that items from DF2 that match DF1 will become part of the new DF, but those gene ID's unique to DF2 will not be (i.e., DF1 is sort of the template that you care about, such that the output will maintain the original nrow of DF1).

    DF1$Ensemble_Gene_ID <- sapply(rownames(DF1), 
                              function(x){unlist(strsplit(x,':'))[2]})})
    
    DF3 <- dplyr::left_join(DF1, DF2, by=c('Ensemble_Gene_ID'='External_Gene_ID'))
    

    but I'm unsure if this works without an example chunk of each data.frame.