Search code examples
rplotstatisticscorrelationchi-squared

Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?


Actually there are 2 questions, one is more advanced than the other.

Q1: I am looking for a method that similar to corrplot() but can deal with factors.

I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V, while the colour is rendered by p-value. Or any other kind of similar plot.

Regarding Cramer's V, let's say tbl is a 2-dimensional factor data frame.

chi2 <- chisq.test(tbl, correct=F)
Cramer_V <- sqrt(chi2$/nrow(tbl)) 

I prepared a test data frame with factors:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

Q2: Then I would like to compute a correlation/association matrix on a mixed-types dataframe e.g.:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 
df$group <- factor(df$group, levels = c('A', 'B', 'C'), ordered = T)
df$student <- as.integer(df$student)

Solution

  • Here's a tidyverse solution:

    # example dataframe
    df <- data.frame(
      group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
      student = c('01', '01', '01', '02', '02', '01', '02'),
      exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
      subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
    ) 
    
    library(tidyverse)
    library(lsr)
    
    # function to get chi square p value and Cramers V
    f = function(x,y) {
        tbl = df %>% select(x,y) %>% table()
        chisq_pval = round(chisq.test(tbl)$p.value, 4)
        cramV = round(cramersV(tbl), 4) 
        data.frame(x, y, chisq_pval, cramV) }
    
    # create unique combinations of column names
    # sorting will help getting a better plot (upper triangular)
    df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)
    
    # apply function to each variable combination
    df_res = map2_df(df_comb$X1, df_comb$X2, f)
    
    # plot results
    df_res %>%
      ggplot(aes(x,y,fill=chisq_pval))+
      geom_tile()+
      geom_text(aes(x,y,label=cramV))+
      scale_fill_gradient(low="red", high="yellow")+
      theme_classic()
    

    enter image description here

    Note that I'm using lsr package to calculate Cramers V using the cramersV function.