I need to reorganize a huge dataframe based on the values on the 'UCSC_RefGene_Name' and 'UCSC_RefGene_Group' columns, while maintaining each values from the 'ID' column. It would be possible to achieve this using reshape2 or dplyr packages?
df <- data.frame(ID=c('cg00035864','cg00050873', 'cg00061679', 'cg00063477', 'cg00121626', 'cg00212031'),
UCSC_RefGene_Name=c('TTTY18', 'TSPY4;FAM197Y2', 'DAZ1;DAZ4;DAZ5', 'EIF1AY', 'BCORL2', 'TTTY14'),
UCSC_RefGene_Group = c('TSS1500','Body;TSS1500','Body;Body;Body', 'Body', 'Body', 'TSS200'))
> df
ID UCSC_RefGene_Name UCSC_RefGene_Group
cg00035864 TTTY18 TSS1500
cg00050873 TSPY4;FAM197Y2 Body;TSS1500
cg00061679 DAZ1;DAZ4;DAZ4 Body;Body;Body
cg00063477 EIF1AY Body
cg00121626 BCORL2 Body
cg00212031 TTTY14 TSS200
Some cells contain multiple elements separated by the ";" symbol, so I need to create new rows to accommodate these elements. My new DF would be:
new_df <- data.frame(ID=c('cg00035864','cg00050873', 'cg00050873','cg00061679', 'cg00061679','cg00061679','cg00063477', 'cg00121626', 'cg00212031'),
UCSC_RefGene_Name=c('TTTY18', 'TSPY4','FAM197Y2', 'DAZ1','DAZ4','DAZ5', 'EIF1AY', 'BCORL2', 'TTTY14'),
UCSC_RefGene_Group = c('TSS1500','Body','TSS1500','Body','Body','Body', 'Body', 'Body', 'TSS200'))
> new_df
ID UCSC_RefGene_Name UCSC_RefGene_Group
cg00035864 TTTY18 TSS1500
cg00050873 TSPY4 Body
cg00050873 FAM197Y2 TSS1500
cg00061679 DAZ1 Body
cg00061679 DAZ4 Body
cg00061679 DAZ5 Body
cg00063477 EIF1AY Body
cg00121626 BCORL2 Body
cg00212031 TTTY14 TSS200
We can apply separate_rows
on the two columns
library(dplyr)
library(tidyr)
df %>%
separate_rows(UCSC_RefGene_Name, UCSC_RefGene_Group)
# ID UCSC_RefGene_Name UCSC_RefGene_Group
#1 cg00035864 TTTY18 TSS1500
#2 cg00050873 TSPY4 Body
#3 cg00050873 FAM197Y2 TSS1500
#4 cg00061679 DAZ1 Body
#5 cg00061679 DAZ4 Body
#6 cg00061679 DAZ5 Body
#7 cg00063477 EIF1AY Body
#8 cg00121626 BCORL2 Body
#9 cg00212031 TTTY14 TSS200