Search code examples
rsplittidyrseparator

Split multiple columns into multiple columns using r


I have a txt file which contains 20 columns and 300 rows. The sample of my data is given below.

id  sub     A1                      A2      B1           B2                    C1   
96  AAA 01:01:01:01/01:01:01:02N        29:02:01    08:01:01/08:19N 44:03:01/44:03:03/44:03:04  07:01:01/07:01:02
97  AAA 03:01:01:01/03:01:01:02N        30:08:01    09:02:01/08:19N 44:03:01/44:03:03/44:03:04  07:01:01/07:01:02
98 AAA  01:01:01:01/01:01:01:02N/01:22N 29:02:01    08:01:01/08:19N 44:03:01/44:03:03/44:03:04  07:09:01/07:01:02
99  AAA 03:01:01:01                     30:08:01    09:02:01/08:19N 44:03:01/44:03:03/44:03:04  07:08:01/07:01:02 

I need to seperate the columns (A1,A2,B1....) with the seperator "/" using r. The output would be:

   id   sub A1_1      A1_2         A2       B1_1     B1_2    B2_1  B2_2   ..
96  AAA 01:01:01:01   01:01:01:02N      29:02:01    08:01:01     08:19N      44:03:01  44:03:03   44:03:04  ...

I could find functions to split one columns into multiple columns. But I could not find a solution to achieve this.


Solution

  • I suggest a reshape2 solution taking care of not knowing the number of parts:

    > dput(pz1)
    structure(list(id = c("HG00096", "HG00097", "HG00098", "HG00099"
    ), sub = c("GBR", "GBR", "GBR", "GBR"), HLA_A1 = c("01:01:01:01/01:01:01:02N", 
    "03:01:01:01/03:01:01:02N", "01:01:01:01/01:01:01:02N/01:22N", 
    "03:01:01:01"), HLA_A2 = c("29:02:01", "30:08:01", "29:02:01", 
    "30:08:01"), HLA_B1 = c("08:01:01/08:19N", "09:02:01/08:19N", 
    "08:01:01/08:19N", "09:02:01/08:19N"), HLA_B2 = c("44:03:01/44:03:03/44:03:04", 
    "44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04"
    ), HLA_C1 = c("07:01:01/07:01:02", "07:01:01/07:01:02", "07:09:01/07:01:02", 
    "07:08:01/07:01:02")), .Names = c("id", "sub", "HLA_A1", "HLA_A2", 
    "HLA_B1", "HLA_B2", "HLA_C1"), row.names = c(NA, -4L), class = "data.frame")
    

    add this function:

    library("reshape2", lib.loc="~/R/win-library/3.3")
    
    getIt <- function(df,col) {    
    x <- max(sapply(strsplit(df[,col],split="/"),length))   ### get the max parts for column
    q <- colsplit(string = df[,col],pattern="/",names = paste0(names(df)[col],"_",LETTERS[1:x]))
    return(q) }
    

    after you have this function you can easily do:

    > getIt(pz1,3)
         HLA_A1_A     HLA_A1_B HLA_A1_C
    1 01:01:01:01 01:01:01:02N         
    2 03:01:01:01 03:01:01:02N         
    3 01:01:01:01 01:01:01:02N   01:22N
    4 03:01:01:01                      
    

    and a simple cbind with the original dataframe (with or without the original columns) :

    > cbind(pz1[,1:2],getIt(pz1,3),getIt(pz1,4),getIt(pz1,5),getIt(pz1,6))
           id sub    HLA_A1_A     HLA_A1_B HLA_A1_C HLA_A2_A HLA_B1_A HLA_B1_B HLA_B2_A HLA_B2_B HLA_B2_C
    1 HG00096 GBR 01:01:01:01 01:01:01:02N          29:02:01 08:01:01   08:19N 44:03:01 44:03:03 44:03:04
    2 HG00097 GBR 03:01:01:01 03:01:01:02N          30:08:01 09:02:01   08:19N 44:03:01 44:03:03 44:03:04
    3 HG00098 GBR 01:01:01:01 01:01:01:02N   01:22N 29:02:01 08:01:01   08:19N 44:03:01 44:03:03 44:03:04
    4 HG00099 GBR 03:01:01:01                       30:08:01 09:02:01   08:19N 44:03:01 44:03:03 44:03:04