I have a data frame, with each column containing data on group membership. For each column, I would like to calculate the frequency of each group in that column as well as the percentage of each group out of the total frequency for that particular column.
I would like the results (frequency table) to be stored in a list, with each element of the list being the result (frequency by group, percentage by group) for a particular column.
Here is what the data frame look like:
df <- data.frame(gender = c("male", "female", "", "male", "female"),
score1 = c(1, 1, NA, NA, 3),
score2 = c(NA, NA, 3, 4, 5),
score3 = c(2, 2, 3, 3, NA))
I have attempted the following code, which works on only 1 variable/column at a time.
countsbyvars <- function(var) {
df %>%
select({{var}}) %>%
drop_na() %>%
group_by({{var}}) %>%
summarise(n = n()) %>%
mutate(freq = paste0(n / sum(n) * 100, "%"))
}
I would like to loop the column names of the data frame as input into the above user-generated function and store the result into a list.
I have attempted the following 3 pieces of code that I have written, but it does not working
# Using a for loop
res <- list()
for (i in names(df)) {
res[i] <- countsbyvars(i)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `i` is not found.
}
# Using lapply to loop
lapply(df, countsbyvars(names(df)))
Error in `group_by()`:
ℹ In argument: `names(df)`.
Caused by error:
! `names(df)` must be size 0 or 1, not 317.
# Not sure how to use the across() function without hardcoding the column name
The end result would be something like this:
[[1]]
male | 2 | 50%
female | 2 | 50%
[[2]]
1 | 2 | 66.666666%
3 | 1 | 33.333333%
[[3]]
3 | 1 |33.33333%
4 | 1 | 33.33333%
5 | 1 | 33.33333%
[[4]]
2 | 2 | 50%
3 | 2 | 50%
v3
Probably increases performance:
lapply(names(df), \(i) {
a = table(df[[i]]) |> as.data.frame()
names(a)[1L] = i; x = sum(a$Freq)
a$Perc = sprintf("%.2f%%", 100L * a$Freq / x); a })
v2
Create tables with table
and prop.table
, combine to data.frame
, and "setNames
" in one lapply
call:
lapply(seq_along(df), \(i) {
a = table(df[[i]])
b = sprintf("%.2f%%", round(prop.table(a), 4L)*100L)
setNames(data.frame(a, b), c(names(df)[i], "Freq", "Perc")) })
#> [[1]]
#> gender Freq Perc
#> 1 female 2 50.00%
#> 2 male 2 50.00%
#>
#> [[2]]
#> score1 Freq Perc
#> 1 1 2 66.67%
#> 2 3 1 33.33%
#>
#> [[3]]
#> score2 Freq Perc
#> 1 3 1 33.33%
#> 2 4 1 33.33%
#> 3 5 1 33.33%
#>
#> [[4]]
#> score3 Freq Perc
#> 1 2 2 50.00%
#> 2 3 2 50.00%
df <- data.frame(gender = c("male", "female", "", "male", "female"),
score1 = c(1, 1, NA, NA, 3),
score2 = c(NA, NA, 3, 4, 5),
score3 = c(2, 2, 3, 3, NA))
df[df == ""] = NA