Search code examples
rggplot2scatter-plot

Hexbin with multiple groups in one plot


I have a dataframe with millions of observation and thus need to display the results as a hexbin chart using geom_hex() or variant.

My plot is already faceted by outcome~predicor, so I would like the factor variable to be mapped to the color/fill aesthetic of the hex and the alpha to be mapped to the count/frequency of the observation.

In the code below I show what it looks like with a scatter plot but fail to make it work with hexbin. I realise that this mean that for each bin there will be length(factor) hexbins. I don't mind if they overlap.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.3
#> Warning: package 'ggplot2' was built under R version 4.2.3
#> Warning: package 'tibble' was built under R version 4.2.3
#> Warning: package 'tidyr' was built under R version 4.2.3
#> Warning: package 'readr' was built under R version 4.2.3
#> Warning: package 'dplyr' was built under R version 4.2.3
#> Warning: package 'forcats' was built under R version 4.2.3
#> Warning: package 'lubridate' was built under R version 4.2.3
set.seed(123) 
factor <- c(0.7, 0.8, 0.9, 1)
predictor_name <- paste0("predictor_", LETTERS[1:5])
outcome_name <- paste0("outcome_", LETTERS[1:5])

df <- expand.grid(
  factor = factor,
  predictor_name = predictor_name,
  outcome_name = outcome_name
) %>% 
  rowwise() %>%
  mutate(
    predictor_value = list(runif(100)),
    outcome_value = list(runif(100))
  ) %>%
  ungroup() %>% 
  unnest(c(predictor_value, outcome_value))

#scatter plot coloured by factor
df %>%
  ggplot() +
  geom_point(aes(x = predictor_value , y = outcome_value, col = factor),
             alpha = 0.2) +
  facet_grid(outcome_name ~ predictor_name, scales = "free") +
  scale_color_viridis_b() +
  theme_classic() +
  theme(legend.position = "bottom")


# hex bin
df %>%
  ggplot() +
  geom_hex(aes(x = predictor_value , 
                 y = outcome_value)) + # fill = factor, alpha = count
  facet_grid(outcome_name ~ predictor_name, scales = "free") +
  theme_classic() +
  theme(legend.position = "bottom")

Created on 2023-09-11 with reprex v2.0.2


Solution

  • If you want each level of factor to have its own layer, and simply overlay the different layers on each other with different alpha according to the counts, then you can set the fill color to the factor variable, and the alpha to after_stat(count). The effect is better with a smaller number of bins in this case, though the final plot is relatively difficult to interpret:

    df %>%
      ggplot() +
      geom_hex(aes(x = predictor_value , y = outcome_value, fill = factor(factor),
                   alpha = after_stat(count)), bins = 10) +
      facet_grid(outcome_name ~ predictor_name, scales = "free") +
      theme_classic() +
      theme(legend.position = "bottom")
    

    enter image description here

    If factor is to be treated as a numeric variable, and you want the average value of factor within each bin to determine the fill color, then you need to manually hexbin the data:

    cell_df <- do.call('rbind',
            split(df, interaction(df$outcome_name, df$predictor_name)) |>
      lapply(function(d) {
        hb <- hexbin::hexbin(d$predictor_value, d$outcome_value, 
                             xbins = 10, IDs = TRUE)
        cbind(aggregate(d$factor, by = list(hb@cID), FUN = mean),
              count = hb@count, X = hexbin::hcell2xy(hb)$x, 
              Y = hexbin::hcell2xy(hb)$y, outcome_name = d$outcome_name[1],
              predictor_name = d$predictor_name[1])
      }))
    
    cell_df %>%
      ggplot() +
      geom_hex(aes(x = X , y = Y, fill = x, alpha = count),
               stat = 'identity') +
      facet_grid(outcome_name ~ predictor_name, scales = "free") +
      theme_classic() +
      scale_fill_viridis_c() +
      theme(legend.position = "bottom")
    

    enter image description here