I have a large dataset with observational survey data which I would like to aggregate to country-year level (also for factors), in order to use the data as country-level data in another dataset. One df
that I would like to aggregate has the following classes:
character labelled numeric
24 272 50
Where I am pretty sure the labelled class is the result of the Hmisc
library.
I started out as follows, which worked quite well.
dfsum <- df %>%
group_by(countryyear) %>%
summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first (.)))
Surprisingly this leaves me with 244/346 variables (I have no clue why it would be that number, any explanation would be great).
I would like to include as many columns as possible in the dfsum
. I realise that for un-ordered factors that would not provide any useful info, but it will for the ordered factors. For binary variables the value between 0 and 1 would for example give me the size of each category and the ordinal variables are often scales. I tried to do:
dfsum <- df%>%
group_by(countryyear) %>%
summarise_all(funs(if(is.numeric(.)|is.factor(.)) mean(., na.rm = TRUE) else first (.)))
But that did not really do anything (not add any extra variables).
More importantly I would like in the summarization process like to retain the factor information. Is it possible to somehow reattach that information in a different way? For example that it was a binary value (perhaps if more than 50% of the original variable was either 0 or 1), or add the scale (by taking the min and the max of the original variable)?
By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:
#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]