I have a dataframe containing over 150,000 entries. Example below:
ID <- 1111, 1222, 3333, 4444, 1555, 6666
V1 <- 1, 0, 1, 0, 0, 0
V2 <- 1, 0, 0, 0, 0, 1
V3 <- 0, 1, 1, 0, 0, 1
V4 <- 1, 0, 1, 1, 0, 0
AgeGr <- 15-24,24-35,15-24,35-48, 48+, 35-48
All the variables (V1-V4 in the example) are dichotomous questions answered in 0/1. Now I would like to summarise all the occurrence of 0/1 of every variable across age groups. I expect output like this:
Variable V1 V2 V3 V4 # Variale names
Answer 0 1 0 1 0 1 0 1 # answer levels (1/0)
15-24 0 2 1 1 1 1 0 2 # the frequency of "0" and "1" under this age group
24-35 1 0 1 0 0 1 1 0
35-48 2 0 1 1 1 1 0 1
48+ 1 0 1 0 1 0 1 0
I have tried janitor::tabyl, using tabyl(df,AgeGr, V1). Yet it only summarises the V1 in one line. When I tried tabyl(df,AgeGr, df[,V1:V4]), it failed. I am wondering if I could go with tabyl() and use functions like apply()? or should I turn to something else?
Any suggestions will be highly appreciated. Thank you in advance:)
You could do the following:
ID <- c(1111, 1222, 3333, 4444, 1555, 6666)
V1 <- c(1, 0, 1, 0, 0, 0)
V2 <- c(1, 0, 0, 0, 0, 1)
V3 <- c(0, 1, 1, 0, 0, 1)
V4 <- c(1, 0, 1, 1, 0, 0)
AgeGr <- c("15-24","24-35","15-24","35-48", "48+", "35-48")
df <- data.frame(ID=ID,V1=V1,V2=V2,V3=V3,V4=V4,AgeGr = AgeGr, stringsAsFactors = FALSE)
ageAnswerSplit <- split(df[,c("V1","V2","V3","V4")],df[["AgeGr"]])
summarized <- do.call("rbind",lapply(ageAnswerSplit, function(answerdf) {
answertables <- lapply(names(answerdf), function(nam) {
at <- table(answerdf[[nam]])
setNames(data.frame(unname(at["0"]),unname(at["1"])),paste0(nam,":",c(0,1)))
})
do.call("cbind",answertables)
}))
summarized[is.na(summarized)] <- 0
resulting in
> summarized
V1:0 V1:1 V2:0 V2:1 V3:0 V3:1 V4:0 V4:1
15-24 0 2 1 1 1 1 0 2
24-35 1 0 1 0 0 1 1 0
35-48 2 0 1 1 1 1 1 1
48+ 1 0 1 0 1 0 1 0