Search code examples
rdplyrreshape2

R - reshape dataframe with multiple elements in the same cell using reshape2 or dplyr


I need to reorganize a huge dataframe based on the values on the 'UCSC_RefGene_Name' and 'UCSC_RefGene_Group' columns, while maintaining each values from the 'ID' column. It would be possible to achieve this using reshape2 or dplyr packages?

df <- data.frame(ID=c('cg00035864','cg00050873', 'cg00061679', 'cg00063477', 'cg00121626', 'cg00212031'),
UCSC_RefGene_Name=c('TTTY18', 'TSPY4;FAM197Y2', 'DAZ1;DAZ4;DAZ5', 'EIF1AY', 'BCORL2', 'TTTY14'),
UCSC_RefGene_Group = c('TSS1500','Body;TSS1500','Body;Body;Body', 'Body', 'Body', 'TSS200'))


> df
ID  UCSC_RefGene_Name   UCSC_RefGene_Group
cg00035864  TTTY18  TSS1500
cg00050873  TSPY4;FAM197Y2  Body;TSS1500
cg00061679  DAZ1;DAZ4;DAZ4  Body;Body;Body
cg00063477  EIF1AY  Body
cg00121626  BCORL2  Body
cg00212031  TTTY14  TSS200

Some cells contain multiple elements separated by the ";" symbol, so I need to create new rows to accommodate these elements. My new DF would be:

new_df <- data.frame(ID=c('cg00035864','cg00050873', 'cg00050873','cg00061679', 'cg00061679','cg00061679','cg00063477', 'cg00121626', 'cg00212031'),
UCSC_RefGene_Name=c('TTTY18', 'TSPY4','FAM197Y2', 'DAZ1','DAZ4','DAZ5', 'EIF1AY', 'BCORL2', 'TTTY14'),
UCSC_RefGene_Group = c('TSS1500','Body','TSS1500','Body','Body','Body', 'Body', 'Body', 'TSS200'))                  

> new_df

ID  UCSC_RefGene_Name   UCSC_RefGene_Group
cg00035864  TTTY18  TSS1500
cg00050873  TSPY4   Body
cg00050873  FAM197Y2    TSS1500
cg00061679  DAZ1    Body
cg00061679  DAZ4    Body
cg00061679  DAZ5    Body
cg00063477  EIF1AY  Body
cg00121626  BCORL2  Body
cg00212031  TTTY14  TSS200

Solution

  • We can apply separate_rows on the two columns

    library(dplyr)
    library(tidyr)
    df %>% 
        separate_rows(UCSC_RefGene_Name, UCSC_RefGene_Group)
    #          ID UCSC_RefGene_Name UCSC_RefGene_Group
    #1 cg00035864            TTTY18            TSS1500
    #2 cg00050873             TSPY4               Body
    #3 cg00050873          FAM197Y2            TSS1500
    #4 cg00061679              DAZ1               Body
    #5 cg00061679              DAZ4               Body
    #6 cg00061679              DAZ5               Body
    #7 cg00063477            EIF1AY               Body
    #8 cg00121626            BCORL2               Body
    #9 cg00212031            TTTY14             TSS200