Search code examples
rsummarysummarize

Get a summary for all combinations of grouping like proc summary in SAS


(I do understand that my question is equivalent to this one : R function equivalent to proc summary in SAS But being a new user, I can't comment on the solutions to ask details or explanations and I can't get any of them to work.)

I'm trying to convert a script from SAS to R. The objective is to get a wide summary of a database across multiple variables.

The starting base is like this :

Student ID Flag1 Flag2 Flag3 other flags... weight score
code1 level1 A first smth~~ 2 12
code23 level5 C third smth~else~ 3 9

And in the end I want something like this :

Flag1 Flag2 Flag3 other flags... nb of students weighted mean std dev min 1st quartile ... max nb of students in fist decile ... nb of students in last decile
level1 A first smth~~ 5 10.96 1.5 1 ... ... ... ... ... ...
level5 .All third smth~else~ 1500 8.70 2.7 3 ... ... ... ... ... ...

In SAS it was really easy because proc summary does the summary for each combination of grouping possible, but in R, you only get the lowest level of grouping. With 9 different levels of grouping that's 512 combinations and I think there should be a way to loop some of the work.

Here's how I think I should proceed :

1- List all the different combinations in a dataframe :

Flag1 Flag2 Flag3
.All .All .All
.All .All first
.All .All second
.All A .All
.All B .All
LV1 .All .All
LV2 .All .All
.All A first
.All A second
.All B first
.All B second
LV1 .All first
LV1 .All second
LV2 .All first
LV2 .All second
LV1 A .All
LV1 B .All
LV2 A .All
LV2 B .All
LV1 A first
LV1 A second
LV1 B first
LV1 B second
LV2 A first
LV2 A second
LV2 B first
LV2 B second

2- Make a 2^n length loop that will call the following function :

3- The function would take a line from the last dataframe and then output a dataframe that would contain the summary grouping by some variables + columns with .All for the variables not used for grouping

4- stack each iteration of the loop on each other using bind_rows


Solution

  • I encountered multiple hurdles solving this problem but I ended with a satisfying solution :

    #import the data
    testbase <- read_excel("testbase.xlsx")
    #list all the grouping variables
    variables = c(quo(Flag1), quo(Flag2),quo(Flag3))
    #create the powerset of the list of variables
    listevars=powerSet(variables,length(variables),rev=FALSE)
    
    for (i in 1:length(listevars)){
      testbase=ungroup(testbase)
      if (length(listevars[[i]])!=0){
        testbase=group_by(testbase,!!!listevars[[i]])
      }
      resumepartiel=summarize(testbase,weighted.mean(score,weight))
      varexcl=variables[!(variables %in% listevars[[i]])]
      if (length(varexcl)!=0){
        for(j in 1:length(varexcl)){
          colonne=data.frame(c(rep(".All",times = nrow(resumepartiel))))
          colonne=setNames(colonne,as_name(varexcl[[j]]))
          resumepartiel=bind_cols(colonne,resumepartiel)
        }
      }
      if(i==1){
        resume=resumepartiel
      }
      else{
        resume=bind_rows(resume,resumepartiel)
      }
    }
    

    this code will output what I want for three variables and only the weighted mean but adding more variables or more summary functions is trivial.