Search code examples
rplotggplot2runtimeloess

Quick way to add loess curve to large data set graph


I am trying to plot a vector, y which has 604800 points, against a sequence: x=seq(from=1, to=604800). This is not a problem, but I do need to add a loess curve to the plots.

I have tried this using ggplot2 but this takes forever, and is notoriously bad at plotting large datasets. See R code:

vf <- ggplot(single.prop, aes(x,y)) + geom_line(linetype=1, size=1)
vf <- vf + stat_smooth(method="loess",fullrange=TRUE,aes(outfit=fit1<<-..y..))
vf

I have now tried to use the base package, but this is also taking forever:

lw <- loess(y ~ x,data=single.prop)
plot(y ~ x, data=single.prop,pch=19,cex=0.1)
k <- order(single.prop$x)
lines(single.prop$x[k],lw$fitted[k],col="red",lwd=3)

Does anyone else have any suggestions about what I can do to make this run quicker? I have to do this multiple times, and have so far been waiting about 15 minutes for one plot, and is still not completed.


Solution

  • With this many data points it can indeed last a long time for the plot to render. Of course it depends on the data but often a plot with this many points does not give a very interpretable picture. For both time an interpretability it can be useful to calculate summary stats first and then plot. In your situation I can imagine binning on x and calculating one or multiple stats for y for every bin can be useful. I did a small example with the mean, but you can use the stat of your liking of course. Hope this helps..

    x <- 1:10^6
    y <- x/10^5 + rnorm(10^6)
    plot_dat <- data.frame(x, y)
    p <- ggplot(plot_dat, aes(x,y)) + geom_point()
    
    
    bin_plot_dat <- function(bin_size){
      nr_bins <- nrow(plot_dat) / bin_size
      x2 <- rep(1:nr_bins * bin_size, each = bin_size)
      y2 <- tapply(plot_dat$y, x2, mean)
      data.frame(x = unique(x2), y= y2)
    }
    
    plot_dat2 <- bin_plot_dat(50)
    p2 <- ggplot(plot_dat2, aes(x,y)) +
      geom_point()
    
    p2 + geom_smooth()