Search code examples
rggplot2plotlyr-plotlyggplotly

5x5 Bubble Chart in R


I am trying to reproduce a graphic similar to what is used on the California Dashboard for School Accountability in R Shiny. For sake of this example, consider the following data frame:

student <- c("1234", "4321", "5678", "8765")
schools <- c("ABC", "ABC", "XYZ", "XYZ")
DFS_20 <- c(-34.2, -1.5, 2.8, 8.9)
DFS_21 <- c(-13.5, 27.8, 5.4, 3.9)
school_data <- data.frame("student_id" = student, "school_id" = schools, "DFS_2020" = DFS_20, "DFS_2021" = DFS_21, "Delta_DFS" = DFS_21 - DFS_20)

I would like to somehow plot this data on a grid like this: 5x5 matrix

where you would have a data point in at [x, y] = [4, 1] (with lower left being (0,0) representing student 1234 because their DFS_2021 score is LOW (-13.5) but their year-over-year growth increased significantly (20.7); a point in [x, y] = [4, 3] for student 4321 because their DFS_2021 score is HIGH (27.8) and their year-over-year growth increased significantly (29.3) etc. I want a Bubble chart so that the point increases in size relative to the number of data points within each cell, but I have no idea where to start creating the canvas (with colors) to overlay the data points onto. I know I can translate their scores into lattice points to plot on a 5x5 grid, but to make the grid with the colors is beyond my skillset.


Solution

  • In short you can recode the valus into factors and count each combination in your dataset. With this new table (containing current DFS level, DFS difference level and number of students in each category) you can easily create a point-plot.

    To color code your points you may need an extra column in your table with the color. Therefore I created a meta color table (all DFS combinations and the associated color) and joined the tables.

    Code

    # load packages
    library(tidyverse)
    
    # create color table
    df_col <- crossing(DFS_current_status = factor(c("very low", "low", "medium",
                                                     "high", "very high"),
                                                   levels = c("very high", "high", 
                                                                  "medium", "low", "very low")), 
                       DFS_diff = factor(c("declined significantly", "declined",
                                    "maintained" ,"increased", 
                                    "increased significantly"), 
                                    levels = c("declined significantly",
                                               "declined", "maintained",
                                               "increased", "increased significantly"))) %>%
      add_column(color = c("green", "green", "blue", "blue","blue",
                           "green", "green", "green", "green", "blue",
                           "yellow", "yellow", "yellow", "green", "green", 
                           "orange", "orange", "orange", "yellow", "yellow", 
                           "red", "red", "red", "orange", "orange"))
    
    
    # transform data
    df <- school_data %>%  
      mutate(DFS_current_status = case_when(DFS_2021 >=  45 ~ "very high",
                                            between(DFS_2021, 10, 44.9) ~ "high",
                                            between(DFS_2021, -5, 9.9) ~ "medium",
                                            between(DFS_2021, -70, -5.1) ~ "low",
                                            DFS_2021 < -70 ~ "very low",
                                            TRUE ~ NA_character_),
             DFS_diff = case_when(Delta_DFS < -15 ~ "declined significantly",
                                  between(Delta_DFS, -15, -3) ~ "declined",
                                  between(Delta_DFS, -2.9, 2.9) ~ "maintained",
                                  between(Delta_DFS, 3, 14.9) ~ "increased",
                                  Delta_DFS >= 15 ~ "increased significantly",
                                  TRUE ~ NA_character_)) %>%
      count(DFS_current_status, DFS_diff) %>%
      left_join(df_col) %>%
      mutate(DFS_current_status = factor(DFS_current_status,
                                         levels = rev(c("very high", "high", 
                                                    "medium", "low", "very low"))),
             DFS_diff = factor(DFS_diff, 
                               levels = c("declined significantly",
                                          "declined", "maintained",
                                          "increased", "increased significantly")))
    
    
    # create plot
    p <- ggplot(df) +
      geom_point(aes(x = DFS_diff,
                     y = DFS_current_status, 
                     size = n,
                     color = color)) +
      scale_y_discrete(drop = F) +
      scale_x_discrete(drop = F, position = "top") +
      scale_color_identity()
    
    # display plot in plotly
    ggplotly(p) %>% 
      layout(xaxis = list(side ="top")) 
    

    Plot enter image description here

    Edit: Comment - color the grid instead of point

    df_col %>% 
      ggplot() +
      geom_raster(aes(x = DFS_diff, 
                      y = rev(DFS_current_status),
                      fill= color)) +
      scale_fill_identity() +
      scale_x_discrete(position = "top") +
      geom_point(data = df, aes(x = DFS_diff,
                     y = DFS_current_status,
                     size = n))
    

    enter image description here