I have a txt file which contains 20 columns and 300 rows. The sample of my data is given below.
id sub A1 A2 B1 B2 C1
96 AAA 01:01:01:01/01:01:01:02N 29:02:01 08:01:01/08:19N 44:03:01/44:03:03/44:03:04 07:01:01/07:01:02
97 AAA 03:01:01:01/03:01:01:02N 30:08:01 09:02:01/08:19N 44:03:01/44:03:03/44:03:04 07:01:01/07:01:02
98 AAA 01:01:01:01/01:01:01:02N/01:22N 29:02:01 08:01:01/08:19N 44:03:01/44:03:03/44:03:04 07:09:01/07:01:02
99 AAA 03:01:01:01 30:08:01 09:02:01/08:19N 44:03:01/44:03:03/44:03:04 07:08:01/07:01:02
I need to seperate the columns (A1,A2,B1....) with the seperator "/" using r. The output would be:
id sub A1_1 A1_2 A2 B1_1 B1_2 B2_1 B2_2 ..
96 AAA 01:01:01:01 01:01:01:02N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 ...
I could find functions to split one columns into multiple columns. But I could not find a solution to achieve this.
I suggest a reshape2 solution taking care of not knowing the number of parts:
> dput(pz1)
structure(list(id = c("HG00096", "HG00097", "HG00098", "HG00099"
), sub = c("GBR", "GBR", "GBR", "GBR"), HLA_A1 = c("01:01:01:01/01:01:01:02N",
"03:01:01:01/03:01:01:02N", "01:01:01:01/01:01:01:02N/01:22N",
"03:01:01:01"), HLA_A2 = c("29:02:01", "30:08:01", "29:02:01",
"30:08:01"), HLA_B1 = c("08:01:01/08:19N", "09:02:01/08:19N",
"08:01:01/08:19N", "09:02:01/08:19N"), HLA_B2 = c("44:03:01/44:03:03/44:03:04",
"44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04"
), HLA_C1 = c("07:01:01/07:01:02", "07:01:01/07:01:02", "07:09:01/07:01:02",
"07:08:01/07:01:02")), .Names = c("id", "sub", "HLA_A1", "HLA_A2",
"HLA_B1", "HLA_B2", "HLA_C1"), row.names = c(NA, -4L), class = "data.frame")
add this function:
library("reshape2", lib.loc="~/R/win-library/3.3")
getIt <- function(df,col) {
x <- max(sapply(strsplit(df[,col],split="/"),length)) ### get the max parts for column
q <- colsplit(string = df[,col],pattern="/",names = paste0(names(df)[col],"_",LETTERS[1:x]))
return(q) }
after you have this function you can easily do:
> getIt(pz1,3)
HLA_A1_A HLA_A1_B HLA_A1_C
1 01:01:01:01 01:01:01:02N
2 03:01:01:01 03:01:01:02N
3 01:01:01:01 01:01:01:02N 01:22N
4 03:01:01:01
and a simple cbind with the original dataframe (with or without the original columns) :
> cbind(pz1[,1:2],getIt(pz1,3),getIt(pz1,4),getIt(pz1,5),getIt(pz1,6))
id sub HLA_A1_A HLA_A1_B HLA_A1_C HLA_A2_A HLA_B1_A HLA_B1_B HLA_B2_A HLA_B2_B HLA_B2_C
1 HG00096 GBR 01:01:01:01 01:01:01:02N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04
2 HG00097 GBR 03:01:01:01 03:01:01:02N 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04
3 HG00098 GBR 01:01:01:01 01:01:01:02N 01:22N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04
4 HG00099 GBR 03:01:01:01 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04