Search code examples
rfrequency

Table of categorical variables by a grouping variable in R


I have a dataset with some categorical variables + a "cluster" variable. For example:

time <- c("Morning", "Evening" ,"Morning", "Morning", "Afternoon", "Evening", "Afternoon")
dollar <- c("1-5", "6-10", "11-15", "1-5", "1-5", "6-10", "6-10")
with_kids <- c("no", "yes", "yes", "no", "no", "yes", "yes")
cluster <- c(1,1,2,3,2,2,3)

data <- cbind(time, dollar, with_kids, cluster)

How can I create a frequency table of all the categorical variables by "cluster"?

Desired Output -> table on the right (i.e. % within each cluster, for each categorical variable

Desired output is the table on the right (column % of each categorical variable within each cluster).

I know this code will work for one variable. What is the most efficient way to do it if I have many more categorical variables?

table(data$time, data$cluster)

Solution

  • I'm not entirely sure of your desired output, but here are two possibilities.

    A list of tables:

    myList <- lapply(dat[head(names(dat), -1)], table, dat$cluster)
    myList
    $time
    
                1 2 3
      Afternoon 0 1 1
      Evening   1 1 0
      Morning   1 1 1
    
    $dollar
    
            1 2 3
      1-5   1 1 1
      11-15 0 1 0
      6-10  1 1 1
    
    $with_kids
    
          1 2 3
      no  1 1 1
      yes 1 2 1
    

    To get a list of proportion tables, you can lapply your list of tables using prop.table as the function and feed it margin=2:

    lapply(myList, prop.table, margin=2)
    $time
    
                        1         2         3
      Afternoon 0.0000000 0.3333333 0.5000000
      Evening   0.5000000 0.3333333 0.0000000
      Morning   0.5000000 0.3333333 0.5000000
    
    $dollar
    
                    1         2         3
      1-5   0.5000000 0.3333333 0.5000000
      11-15 0.0000000 0.3333333 0.0000000
      6-10  0.5000000 0.3333333 0.5000000
    
    $with_kids
    
                  1         2         3
      no  0.5000000 0.3333333 0.5000000
      yes 0.5000000 0.6666667 0.5000000
    

    to rbind them together

    do.call(rbind, lapply(dat[head(names(dat), -1)], table, dat$cluster))
              1 2 3
    Afternoon 0 1 1
    Evening   1 1 0
    Morning   1 1 1
    1-5       1 1 1
    11-15     0 1 0
    6-10      1 1 1
    no        1 1 1
    yes       1 2 1
    

    data

    dat <- 
    structure(list(time = structure(c(3L, 2L, 3L, 3L, 1L, 2L, 1L), .Label = c("Afternoon", 
    "Evening", "Morning"), class = "factor"), dollar = structure(c(1L, 
    3L, 2L, 1L, 1L, 3L, 3L), .Label = c("1-5", "11-15", "6-10"), class = "factor"), 
        with_kids = structure(c(1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("no", 
        "yes"), class = "factor"), cluster = c(1, 1, 2, 3, 2, 2, 
        3)), .Names = c("time", "dollar", "with_kids", "cluster"), row.names = c(NA, 
    -7L), class = "data.frame")