Search code examples
rggplot2visualizationgeom-point

Fix continuous color scale for quick visualisation in ggplot2


I need to plot a huge dataset (1 million data) according to two variables. I want :

  • An equivalent of geom_point to see the distribution of my data
  • A geom_smooth to see the global trend

My data is very concentrated in some areas of my graphs. However, since I have a lot of data, the geom_smooth should be valid across most areas of my graph (but not all).

I can use geom_point() for that, but it really takes a long time to plot, and can lead to misinterpreation since the graph needs to be zoomed in to see the the real position of the points.

set.seed(1)
library(data.table)
library(ggplot2)

d=data.table(a=c(sample(seq(1,1500,1),20000, replace=T),sample(seq(1998,2000,1),1000, replace=T),sample(seq(1,150,1),19000, replace=T)),
             b=c(sample(seq(1,2000,1),20000, replace=T),sample(seq(150,160,1),1000, replace=T),sample(seq(1100,1600,1),19000, replace=T)))

ggplot(d) + aes(x=a,y=b)+
  geom_point(shape = 1,alpha=0.2) +
  geom_smooth(col="black")

Here we have a dezoomed plot : it gives us the impression that the density of the left-part of the graph is rather homogeneous

small plot

While in reality, we have density variations inside this area Zoomed plot

The adress this, and to adress the rapidity problem of geom_point , I found the geom_hex() function.

ggplot(d) + aes(x=a,y=b)+
  geom_hex(bins=70,col="white") +
  geom_smooth(col="black")

First geom hex

Here, we can see that the hexagons to the right are very dense in terms of data, but we barely see that the left part is also denser than the rest of the graph.

To adress this problem, I have set another scale_fill_gradient() like said here. I set the gradient's limits to 0-150, considering that if there is more than 150 observations, the hexagon should be considered as dense.

ggplot(d) + aes(x=a,y=b)+
  geom_hex(bins=70,col="white") + 
  scale_fill_gradient(low="yellow", high="coral2",limits=c(0,150)) +
  geom_smooth(col="black")

The problem is that the hexagons that exceeds 150 observations are blue, but I need them red so the graph could be interpreted. I still want to have some nuance in my graph and keep a gradient for the hexagons with <150 observations (I don't want to have two colors).

second geom hex with fixed colors

Can someone help me with that ?

PS : I used ggthemr::ggthemr("pale") to have prettier graphs, so it is normal if the formating isn't the same for you.

PSS : this is dummy data, obviously my data isn't as boring and homogeneous as this (and the repartition of points is more complicated), I just did what I could to recreate the problem.


Solution

  • If you want a bit more differentiation between lower values on the scale, you can use scale_fill_gradientn and play around with the colours and values arguments to get a result that works well with your data:

    ggplot(d) + 
      aes(x = a, y = b) +
      geom_hex(bins = 70, colour = "white") + 
      scale_fill_gradientn(colours = c("white", 'yellow', 
                                       'gold', 'coral2', 'red2'), 
                           values = c(0, 0.01, 0.1, 0.9, 1)) +
      geom_smooth(colour = "black") +
      theme_minimal()
    

    enter image description here