I have a dataframe with millions of observation and thus need to display the results as a hexbin chart using geom_hex()
or variant.
My plot is already faceted by outcome~predicor
, so I would like the factor variable to be mapped to the color/fill
aesthetic of the hex and the alpha
to be mapped to the count/frequency of the observation.
In the code below I show what it looks like with a scatter plot but fail to make it work with hexbin. I realise that this mean that for each bin there will be length(factor)
hexbins. I don't mind if they overlap.
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.3
#> Warning: package 'ggplot2' was built under R version 4.2.3
#> Warning: package 'tibble' was built under R version 4.2.3
#> Warning: package 'tidyr' was built under R version 4.2.3
#> Warning: package 'readr' was built under R version 4.2.3
#> Warning: package 'dplyr' was built under R version 4.2.3
#> Warning: package 'forcats' was built under R version 4.2.3
#> Warning: package 'lubridate' was built under R version 4.2.3
set.seed(123)
factor <- c(0.7, 0.8, 0.9, 1)
predictor_name <- paste0("predictor_", LETTERS[1:5])
outcome_name <- paste0("outcome_", LETTERS[1:5])
df <- expand.grid(
factor = factor,
predictor_name = predictor_name,
outcome_name = outcome_name
) %>%
rowwise() %>%
mutate(
predictor_value = list(runif(100)),
outcome_value = list(runif(100))
) %>%
ungroup() %>%
unnest(c(predictor_value, outcome_value))
#scatter plot coloured by factor
df %>%
ggplot() +
geom_point(aes(x = predictor_value , y = outcome_value, col = factor),
alpha = 0.2) +
facet_grid(outcome_name ~ predictor_name, scales = "free") +
scale_color_viridis_b() +
theme_classic() +
theme(legend.position = "bottom")
# hex bin
df %>%
ggplot() +
geom_hex(aes(x = predictor_value ,
y = outcome_value)) + # fill = factor, alpha = count
facet_grid(outcome_name ~ predictor_name, scales = "free") +
theme_classic() +
theme(legend.position = "bottom")
Created on 2023-09-11 with reprex v2.0.2
If you want each level of factor
to have its own layer, and simply overlay the different layers on each other with different alpha according to the counts, then you can set the fill color to the factor
variable, and the alpha to after_stat(count)
. The effect is better with a smaller number of bins in this case, though the final plot is relatively difficult to interpret:
df %>%
ggplot() +
geom_hex(aes(x = predictor_value , y = outcome_value, fill = factor(factor),
alpha = after_stat(count)), bins = 10) +
facet_grid(outcome_name ~ predictor_name, scales = "free") +
theme_classic() +
theme(legend.position = "bottom")
If factor
is to be treated as a numeric variable, and you want the average value of factor
within each bin to determine the fill color, then you need to manually hexbin the data:
cell_df <- do.call('rbind',
split(df, interaction(df$outcome_name, df$predictor_name)) |>
lapply(function(d) {
hb <- hexbin::hexbin(d$predictor_value, d$outcome_value,
xbins = 10, IDs = TRUE)
cbind(aggregate(d$factor, by = list(hb@cID), FUN = mean),
count = hb@count, X = hexbin::hcell2xy(hb)$x,
Y = hexbin::hcell2xy(hb)$y, outcome_name = d$outcome_name[1],
predictor_name = d$predictor_name[1])
}))
cell_df %>%
ggplot() +
geom_hex(aes(x = X , y = Y, fill = x, alpha = count),
stat = 'identity') +
facet_grid(outcome_name ~ predictor_name, scales = "free") +
theme_classic() +
scale_fill_viridis_c() +
theme(legend.position = "bottom")