Search code examples
rdataframelistdplyr

Creating a frequency table for each column in a data frame using a loop and storing the results in a list


I have a data frame, with each column containing data on group membership. For each column, I would like to calculate the frequency of each group in that column as well as the percentage of each group out of the total frequency for that particular column.

I would like the results (frequency table) to be stored in a list, with each element of the list being the result (frequency by group, percentage by group) for a particular column.

Here is what the data frame look like:

df <- data.frame(gender = c("male", "female", "", "male", "female"),
                 score1 = c(1, 1, NA, NA, 3),
                 score2 = c(NA, NA, 3, 4, 5),
                 score3 = c(2, 2, 3, 3, NA))

I have attempted the following code, which works on only 1 variable/column at a time.

countsbyvars <- function(var) {
  df %>%
    select({{var}}) %>%
    drop_na() %>%
    group_by({{var}}) %>%
    summarise(n = n()) %>%
    mutate(freq = paste0(n / sum(n) * 100, "%"))
}

I would like to loop the column names of the data frame as input into the above user-generated function and store the result into a list.

I have attempted the following 3 pieces of code that I have written, but it does not working

# Using a for loop
res <- list()

for (i in names(df)) {
  res[i] <- countsbyvars(i)

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `i` is not found.
}

# Using lapply to loop
lapply(df, countsbyvars(names(df)))

Error in `group_by()`:
ℹ In argument: `names(df)`.
Caused by error:
! `names(df)` must be size 0 or 1, not 317.

# Not sure how to use the across() function without hardcoding the column name

The end result would be something like this:

[[1]]

male | 2 | 50%
female | 2 | 50%

[[2]]

1 | 2 | 66.666666%
3 | 1 | 33.333333%

[[3]]

3 | 1 |33.33333%
4 | 1 | 33.33333%
5 | 1 | 33.33333%

[[4]]

2 | 2 | 50%
3 | 2 | 50%

Solution

  • v3

    Probably increases performance:

    lapply(names(df), \(i) {
      a = table(df[[i]]) |> as.data.frame()
      names(a)[1L] = i; x = sum(a$Freq) 
      a$Perc = sprintf("%.2f%%", 100L * a$Freq / x); a }) 
    

    v2

    Create tables with table and prop.table, combine to data.frame, and "setNames" in one lapply call:

    lapply(seq_along(df), \(i) {
      a = table(df[[i]])
      b = sprintf("%.2f%%", round(prop.table(a), 4L)*100L)
      setNames(data.frame(a, b), c(names(df)[i], "Freq", "Perc")) })
    
    #> [[1]]
    #>   gender Freq   Perc
    #> 1 female    2 50.00%
    #> 2   male    2 50.00%
    #> 
    #> [[2]]
    #>   score1 Freq   Perc
    #> 1      1    2 66.67%
    #> 2      3    1 33.33%
    #> 
    #> [[3]]
    #>   score2 Freq   Perc
    #> 1      3    1 33.33%
    #> 2      4    1 33.33%
    #> 3      5    1 33.33%
    #> 
    #> [[4]]
    #>   score3 Freq   Perc
    #> 1      2    2 50.00%
    #> 2      3    2 50.00%
    

    Data

    df <- data.frame(gender = c("male", "female", "", "male", "female"),
                     score1 = c(1, 1, NA, NA, 3),
                     score2 = c(NA, NA, 3, 4, 5),
                     score3 = c(2, 2, 3, 3, NA))
    df[df == ""] = NA