Search code examples
rggplot2heatmapcategorical-datapheatmap

Heatmap of categorical variable counts


I have a data frame of items, and each has multiple classifier columns that are categorical variables.

ID    test1    test2     test3
1     A        B         A
2     B        A         C
3     C        C         C
4     A        A         B
5     B        B         B
6     B        A         C

I want to generate a heatmap for each combination of test columns (test1 v test2, test1 v test3, etc.) using ggplot2. The heatmap would have all factors in that test's column (in this case A,B,C) on the x-side and all factors of the other test on the y-side, and the boxes in the heatmap should be colored based on the count of ids that have that combination of classifier.

For example in the above input, if we have heatmap between test1 and test2, then the box that is in the intersection of B for test1 and A for test2 would be brightest, since there are 2 ids with that combination. I hope to use these heatmaps to analyze which tests are most congruent for the data set, but can't use a Pearson's R correlation since they are categorical variables.

I am familiar with ggplot, which is why I prefer that package, but if it is easier in pheatplot, I am okay with learning that.


Solution

  • Took some time to realize how to do it, and still I am not sure it is the best way.

    Data:
    dat = structure(list(ID = 1:6, 
                         test1 = c("A", "B", "C", "A", "B", "B"), 
                         test2 = c("B", "A", "C", "A", "B", "A"), 
                         test3 = c("A", "C", "C", "B", "B", "C")
                         ), 
                    .Names = c("ID", "test1", "test2", "test3"), 
                     class = "data.frame", row.names = c(NA, -6L)
                    )
    
    Libraries
    library(tidyverse)
    library(ggthemes)
    library(gridExtra)
    
    Create all all combinations of factors (also tests) taken 2 at a time
    fcombs <- expand.grid(LETTERS[1:3], LETTERS[1:3], stringsAsFactors = F)
    tcombs <- as.data.frame(combn(colnames(dat[,-1]), 2), stringsAsFactors = F)
    
    lapply through the tests combinations, full_join, count length of each group excluding NAs
    dtl <- lapply(tcombs, function(i){
            select(dat, ID, i) %>%
            full_join(x = fcombs, by = c("Var1" = i[1], Var2 = i[2])) %>%
            group_by(Var1, Var2) %>%
            mutate(N = sum(!is.na(ID)), ID = NULL) %>%
            ungroup()
      }
    )
    
    Create a list of plots
    pl <- lapply(seq_along(tcombs), function(i){
            gtitle = paste(tcombs[[i]], collapse = " ~ ")
            dtl[[i]] %>%
            ggplot(aes(x = Var1, y = Var2, fill = N)) +
            geom_tile() +
            theme_tufte() +
            theme(axis.title = element_blank()) +
            ggtitle(gtitle)
            }
          )
    
    Create list of tables (tableGrob objects)
    tbl <- lapply(tcombs, function(i) tableGrob(select(dat, ID, i),  
                                                theme = ttheme_minimal()))
    
    Put everything into the resulting list and plot
    resl <- c(pl, tbl)[c(1, 4, 2, 5, 3, 6)]
    
    grid.arrange(grobs = resl, ncol = 2, nrow = 3)
    

    heatmaps