I am having troubles converting a complicated data base output from its wide form to the long format. It has a couple of hundred rows and ~1,000 columns. It looks something like this:
The problem is that the variable of df1_long or the columns of df1_wide contain information that I want to select upon. So I rather want to have something like this:
ID part task subgroup type result
Ind_A a 12 aa 2 yes
Ind_A a 12 bb 2 yes
Ind_A b 12 aa 3 opt_1
Ind_A b 13 aa 4 100
Ind_B a 12 aa 2 no
Ind_B a 12 bb 2 yes
Ind_B b 12 aa 3 opt_2
Ind_B b 13 aa 4 50
Ind_C a 12 aa 2 no
Ind_C a 12 bb 2 no
Ind_C b 12 aa 3 opt_1
Ind_C b 13 aa 4 200
I dont mind that the numerical values/results get transformed into characters.
Is there a easy way to accomplish that with the stack() or reshape() function? Or do I have to code a function that performes some strsplits commands on the column names to extract the relevant information and then store those in separate new columns.
The long format will definitely be much more easy to work with and ask questions to the data set. Coding something for >1000 columns with those cryptic columns names just looks like a nightmare to me.
Ok fine. I give in.
Here's your data:
df1_wide <- data.frame(v1 = c("Ind_A", "Ind_B", "Ind_C"),
v2 = c("Y", "N", "N"), v3 = c("Y", "Y", "N"),
v4 = c("op1_1", "opt_2", "opt_1"),
v5 = c(100, 50, 200))
rownames(df1_wide) <- letters[1:3]
colnames(df1_wide) <- c("ID", "a_t12_aa (Type # 2)", "a_t12_bb (Type # 2)",
"b_t12_aa (Type # 3)", "b_t13_aa (Type # 4)")
df1_wide
# ID a_t12_aa (Type # 2) a_t12_bb (Type # 2) b_t12_aa (Type # 3) b_t13_aa (Type # 4)
# a Ind_A Y Y op1_1 100
# b Ind_B N Y opt_2 50
# c Ind_C N N opt_1 200
This is what you have done so far:
df1_long <- melt(df1_wide, id.vars="ID")
This is what it sounds like you want:
cbind(df1_long["ID"],
colsplit(gsub("\\s|\\(Type|\\)|#", " ", df1_long$variable),
pattern="_|\\s+",
names = c("part", "task", "subgroup", "type")),
df1_long["value"])
# ID part task subgroup type value
# 1 Ind_A a t12 aa 2 Y
# 2 Ind_B a t12 aa 2 N
# 3 Ind_C a t12 aa 2 N
# 4 Ind_A a t12 bb 2 Y
# 5 Ind_B a t12 bb 2 Y
# 6 Ind_C a t12 bb 2 N
# 7 Ind_A b t12 aa 3 op1_1
# 8 Ind_B b t12 aa 3 opt_2
# 9 Ind_C b t12 aa 3 opt_1
# 10 Ind_A b t13 aa 4 100
# 11 Ind_B b t13 aa 4 50
# 12 Ind_C b t13 aa 4 200
I can't guarantee that the regular expression that I used with gsub
will work with your actual data, but hopefully this is enough to point you in the right direction.
In the future, please note that sharing a screenshot of your data is pretty useless. Please share data as I did in this question, so that others can easily copy and paste to get started on trying to help you out.
You may want to consider that you rename your variables along the lines of a_t12_aa_2
, in the future, in which case, the colsplit
step would just be colsplit(variable, "_", c("part", "task", "subgroup", "type"))
.