Search code examples
rsplitstackshape

split dataframe with multiple delimiters in R


df1 <- 
     Gene             GeneLocus 
    CPA1|1357       chr7:130020290-130027948:+     
    GUCY2D|3000     chr17:7905988-7923658:+   
    UBC|7316        chr12:125396194-125399577:-            
    C11orf95|65998  chr11:63527365-63536113:-        
    ANKMY2|57037    chr7:16639413-16685398:- 

expected output

df2 <- 
     Gene.1   Gene.2             chr     start     end 
    CPA1      1357               7     130020290 130027948   
    GUCY2D    3000               17      7905988   7923658  
    UBC       7316               12    125396194 125399577          
    C11orf95  65998              11     63527365  63536113     
    ANKMY2    57037               7     16639413  16685398]]

I tried this way..

install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)

I would like to know if there is any other alternative way to do it in simpler way


Solution

  • Here you go...Just ignore the warning that does not affect the output; it actually has the side effect of removing the strand information (:+ or :-).

    library(tidyr)
    library(dplyr)
    df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))
    

    Output:

        Gene.1 Gene.2 chr     start       end
    1     CPA1   1357   7 130020290 130027948
    2   GUCY2D   3000  17   7905988   7923658
    3      UBC   7316  12 125396194 125399577
    4 C11orf95  65998  11  63527365  63536113
    5   ANKMY2  57037   7  16639413  16685398