(I do understand that my question is equivalent to this one : R function equivalent to proc summary in SAS But being a new user, I can't comment on the solutions to ask details or explanations and I can't get any of them to work.)
I'm trying to convert a script from SAS to R. The objective is to get a wide summary of a database across multiple variables.
The starting base is like this :
Student ID | Flag1 | Flag2 | Flag3 | other flags... | weight | score |
---|---|---|---|---|---|---|
code1 | level1 | A | first | smth~~ | 2 | 12 |
code23 | level5 | C | third | smth~else~ | 3 | 9 |
And in the end I want something like this :
Flag1 | Flag2 | Flag3 | other flags... | nb of students | weighted mean | std dev | min | 1st quartile | ... | max | nb of students in fist decile | ... | nb of students in last decile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
level1 | A | first | smth~~ | 5 | 10.96 | 1.5 | 1 | ... | ... | ... | ... | ... | ... |
level5 | .All | third | smth~else~ | 1500 | 8.70 | 2.7 | 3 | ... | ... | ... | ... | ... | ... |
In SAS it was really easy because proc summary does the summary for each combination of grouping possible, but in R, you only get the lowest level of grouping. With 9 different levels of grouping that's 512 combinations and I think there should be a way to loop some of the work.
Here's how I think I should proceed :
1- List all the different combinations in a dataframe :
Flag1 | Flag2 | Flag3 |
---|---|---|
.All | .All | .All |
.All | .All | first |
.All | .All | second |
.All | A | .All |
.All | B | .All |
LV1 | .All | .All |
LV2 | .All | .All |
.All | A | first |
.All | A | second |
.All | B | first |
.All | B | second |
LV1 | .All | first |
LV1 | .All | second |
LV2 | .All | first |
LV2 | .All | second |
LV1 | A | .All |
LV1 | B | .All |
LV2 | A | .All |
LV2 | B | .All |
LV1 | A | first |
LV1 | A | second |
LV1 | B | first |
LV1 | B | second |
LV2 | A | first |
LV2 | A | second |
LV2 | B | first |
LV2 | B | second |
2- Make a 2^n length loop that will call the following function :
3- The function would take a line from the last dataframe and then output a dataframe that would contain the summary grouping by some variables + columns with .All for the variables not used for grouping
4- stack each iteration of the loop on each other using bind_rows
I encountered multiple hurdles solving this problem but I ended with a satisfying solution :
#import the data
testbase <- read_excel("testbase.xlsx")
#list all the grouping variables
variables = c(quo(Flag1), quo(Flag2),quo(Flag3))
#create the powerset of the list of variables
listevars=powerSet(variables,length(variables),rev=FALSE)
for (i in 1:length(listevars)){
testbase=ungroup(testbase)
if (length(listevars[[i]])!=0){
testbase=group_by(testbase,!!!listevars[[i]])
}
resumepartiel=summarize(testbase,weighted.mean(score,weight))
varexcl=variables[!(variables %in% listevars[[i]])]
if (length(varexcl)!=0){
for(j in 1:length(varexcl)){
colonne=data.frame(c(rep(".All",times = nrow(resumepartiel))))
colonne=setNames(colonne,as_name(varexcl[[j]]))
resumepartiel=bind_cols(colonne,resumepartiel)
}
}
if(i==1){
resume=resumepartiel
}
else{
resume=bind_rows(resume,resumepartiel)
}
}
this code will output what I want for three variables and only the weighted mean but adding more variables or more summary functions is trivial.