Search code examples
rhistogram

How can one add a cumulative trend line based on weight to a histogram in R?


Having some serious trouble adding a cumulative trend line to my histogram below. The key difference from all the examples I can find is that the x-axis should be bins of distance, but the y-axis the sum of tonne.km per bin. I am using weight for this. The cumulative density curve should include the weight = tonne.km.

Some randomly generated data for this.

distance<-rnorm(1000000, mean = 1000, sd = 500)
tonne.km<-rnorm(1000000, mean = 25000, sd = 500)
dist.tk.test <- structure(tibble(distance, tonne.km))

My code:

dist.tk.test %>% 
  ggplot() +
  geom_histogram(aes(x = distance, y=..density.., weight = tonne.km), bins = 50) +
  stat_bin(aes(x = distance, y=cumsum(..density..)),geom="line",color="red") + 
  scale_x_continuous(label = comma,
                     breaks = extended_breaks(10)) +
  scale_y_continuous(labels=function(x)x*1,
                     sec.axis = sec_axis(~ ., labels = scales::percent, name = "Cumulative Share (%)"))

This is the visual outcome:

enter image description here

I would like the line to follow the second y-axis, and the first y-axis to have the sum of tonne.km per bin instead of the current density.

Is this possible using weight=tonne.km? Or do I need to use a completely different graph?

Thanks in advance!


Solution

  • geom_histogram does not have a weight aesthetic so I do not understand how do you want to do with tonne.km. But if you want to superimpose the CDF to the histogram, here is a way.

    First realize that a density such as the empirical histogram density and a ECDF are many times on different scales, specially if the distribution is continuous and the sample is large. Then, the main trick is to scale the ECDF by the maximum density y value.

    library(ggplot2)
    library(scales)
    
    distance <- rnorm(1000000, mean = 1000, sd = 500)
    tonne.km <- rnorm(1000000, mean = 25000, sd = 500)
    dist.tk.test <- data.frame(distance, tonne.km)
    
    bins <- 50L
    x_breaks <- 10L
    
    max_y <- max(density(dist.tk.test$distance)$y)
    
    ggplot(dist.tk.test) +
      geom_histogram(
        aes(x = distance, y = ..density..), bins = bins
      ) +
      geom_line(
        aes(
          x = sort(distance),
          y = max_y * seq_along(distance)/length(distance)
        ),
        color = "red"
      ) +
      scale_x_continuous(label = comma,
                         breaks = extended_breaks(x_breaks)) +
      scale_y_continuous(
        name = "Density",
        sec.axis = sec_axis(~ .x / max_y , 
                            labels = scales::percent, 
                            name = "Cumulative Share (%)")
      )
    

    Created on 2022-08-17 by the reprex package (v2.0.1)


    Edit

    Following the comment below, here is another solution.
    The total tonne.km by bins of distance is computed first.
    In order to do this, the distances must be binned. I use findInterval to bin them and then sum the tonne.km per bin (variable breaks) with aggregate. This is the data.frame used in the plot.

    library(ggplot2)
    library(scales)
    
    set.seed(2022)
    distance <- rnorm(1000000, mean = 1000, sd = 500)
    tonne.km <- rnorm(1000000, mean = 25000, sd = 500)
    dist.tk.test <- data.frame(distance, tonne.km)
    
    breaks <- range(dist.tk.test$distance)
    breaks <- round(breaks/100)*100
    breaks <- seq(breaks[1], breaks[2], by = 50)
    bins <- findInterval(dist.tk.test$distance, breaks)
    breaks <- breaks[bins]
    
    new_df <- aggregate(tonne.km ~ breaks, dist.tk.test, sum, na.rm = TRUE)
    y_max <- max(new_df$tonne.km, na.rm = TRUE)
    
    x_axis_breaks <- 10L
    
    ggplot(new_df, aes(breaks, tonne.km)) +
      geom_col(position = position_dodge(), width = 100) +
      geom_line(
        aes(
          y = y_max * cumsum(tonne.km)/sum(tonne.km)
        ),
        color = "red"
      ) +
      scale_x_continuous(
        name = "Distance",
        label = comma,
        breaks = extended_breaks(x_axis_breaks)) +
      scale_y_continuous(
        name = "Tonne/Km",
        sec.axis = sec_axis(~ .x/y_max, 
                            labels = scales::percent, 
                            name = "Cumulative Share (%)")
      )
    #> Warning: position_dodge requires non-overlapping x intervals
    

    Created on 2022-08-17 by the reprex package (v2.0.1)