I have a data.frame quite large that I have to wrangle it a bit. the current structure is:
V1 V2 V3 V4 V5 V6 V7 V8 ... Vn Vn+1
chr1 1 A T sample_1 value_1 sample_2 value_4 ... sample_n value_7
chr1 40 T C sample_1 value_2 sample_2 value_5 ... sample_n value_8
chr1 60 A T sample_1 value_3 sample_2 value_6 ... sample_n value_9
.
.
.
chrX 160 A T sample_1 value_x sample_2 value_y ... sample_n value_ni
e.g. for the data_frame:
df <- structure(list(V1 = c(10L, 10L, 10L, 10L, 10L, 10L), V2 = c(3387501L,
4174142L, 6419754L, 6419765L, 6419897L, 6419912L), V3 = c("T",
"A", "C", "T", "G", "A"), V4 = c("A",
"T", "A", "A", "C", "G"), V5 = c("LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02", "LP2000748-DNA_H02",
"LP2000748-DNA_H02", "LP2000748-DNA_H02"), V6 = c("0/0", "0/0",
"1/1", "0/0", "0/0", "0/0"), V7 = c("LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03", "LP2000748-DNA_A03", "LP2000748-DNA_A03",
"LP2000748-DNA_A03"), V8 = c("0/0", "0/0", "1/1", "0/1", "0/0",
"0/0"), V9 = c("LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01",
"LP2000795-DNA_B01", "LP2000795-DNA_B01", "LP2000795-DNA_B01"
), V10 = c("0/0", "0/0", "1/1", "0/0", "0/0", "0/0")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
What I'd like to have in the end is a table like this:
V1 V2 V3 V4 sample_1 sample_2 ... sample_n
chr1 1 A T value_1 value_4 ... value_7
chr1 40 T C value_2 value_5 ... value_8
chr1 60 A T value_3 value_6 ... value_9
.
.
.
chrX 160 A T value_x value_y ... value_ni
What I've tried so far in R is:
samples_data <- seq(from = 5, to = dim(df)[2],by=2) variable_data <- samples_data + 1
new_df <- reshape2::dcast(df, V1 + V2 + V3 ~ colnames(df)[samples_data], value.var= colnames(df)[variable_data])
but I get this error message:
recursive indexing failed at level 2 In addition: Warning message: In if (!(value.var %in% names(data))) { : the condition has length > 1 and only the first element will be used
Does anyone have any suggestion on how to tackle this problem or on how to reshape the df?
Thanks!
You probably need to un-nest the data, then use reshape
. To un-nest you could use Map
to generate a list selecting the first four ID columns and from the rest of the columns a pattern 5,6; 7,8; 9,10. rbind
the result and reshape
.
cseq <- 5:ncol(df)
tmp <- do.call(rbind, Map(function(x, y) setNames(df[c(1:4, x:y)],
c(names(df)[1:4], c("sample", "value"))),
cseq[cseq %% 2 != 0], cseq[cseq %% 2 == 0]))
res <- reshape(tmp, idvar=1:4, timevar="sample", v.names="value", direction="wide")
res
# V1 V2 V3 V4 value.LP2000748-DNA_H02 value.LP2000748-DNA_A03 value.LP2000795-DNA_B01
# 1 10 3387501 T A 0/0 0/0 0/0
# 2 10 4174142 A T 0/0 0/0 0/0
# 3 10 6419754 C A 1/1 1/1 1/1
# 4 10 6419765 T A 0/0 0/1 0/0
# 5 10 6419897 G C 0/0 0/0 0/0
# 6 10 6419912 A G 0/0 0/0 0/0