Search code examples
rclassdplyrlabelsummarize

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info


I have a large dataset with observational survey data which I would like to aggregate to country-year level (also for factors), in order to use the data as country-level data in another dataset. One df that I would like to aggregate has the following classes:

character  labelled   numeric 
       24       272        50

Where I am pretty sure the labelled class is the result of the Hmisc library.

I started out as follows, which worked quite well.

dfsum <- df %>%
 group_by(countryyear) %>%
 summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first (.)))

Surprisingly this leaves me with 244/346 variables (I have no clue why it would be that number, any explanation would be great).

I would like to include as many columns as possible in the dfsum. I realise that for un-ordered factors that would not provide any useful info, but it will for the ordered factors. For binary variables the value between 0 and 1 would for example give me the size of each category and the ordinal variables are often scales. I tried to do:

dfsum <- df%>%
 group_by(countryyear) %>%
 summarise_all(funs(if(is.numeric(.)|is.factor(.)) mean(., na.rm = TRUE) else first (.)))

But that did not really do anything (not add any extra variables).

More importantly I would like in the summarization process like to retain the factor information. Is it possible to somehow reattach that information in a different way? For example that it was a binary value (perhaps if more than 50% of the original variable was either 0 or 1), or add the scale (by taking the min and the max of the original variable)?


Solution

  • By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:

    #1
    as.numeric.factor <- function(x) {as.numeric(as.character(x))}
    #2
    df[] = lapply(df, as.numeric.factor)
    #3
    cols = sapply(df, is.numeric)
    cols = names(cols)[cols]
    #4
    dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]
    

    1, 2, 3, 4