Search code examples
rggplot2histogramggplotlygeom-histogram

ggplot histogram grouped by years and density function


I want to create a histogram for my data table dt grouped by acquiYear, where the y-axis represents the nrOrders and the x-axis the month. My data table looks like this:

structure(list(acquiYear = c("2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2015", "2015", 
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015", 
"2015", "2015", "2016", "2016", "2016", "2016", "2016", "2016", 
"2016", "2016", "2016", "2016", "2016", "2016", "2017", "2017", 
"2017", "2017", "2017", "2017", "2017", "2017", "2017", "2017", 
"2017", "2017", "2018", "2018", "2018", "2018", "2018", "2018", 
"2018", "2018", "2018", "2018", "2018", "2018"), month = structure(c(1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("Jan", 
"Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", 
"Nov", "Dec"), class = "factor"), nrOrders = c(0, 0, 0, 0, 0, 
0, 0, 0, 1, 1, 2, 0, 2, 4, 5, 3, 7, 3, 5, 4, 3, 7, 8, 7, 2, 24, 
16, 33, 9, 27, 16, 10, 27, 9, 31, 35, 11, 11, 25, 15, 18, 19, 
19, 8, 27, 34, 43, 51, 0, 11, 2, 0, 0, 0, 0, 0, 4, 5, 1, 0), 
    ), row.names = c(NA, -60L), class = c("data.table", 
"data.frame"))

I need for each month per acquiYear a bar and for each acquiYear over the months a desity line. The colors for year should be c("#00943C", "#4A52A0", "#FDC300", "#6F6F6F", "#EC4C24"). How can I fix this?


Solution

  • The problem is that what you are describing is not a histogram. A histogram is a way to show the distribution of a single continuous variable. Typically, the range of this variable is shown along the x axis, and the axis is split into fixed-width bins. A bar is constructed for each bin where the height of the bar on the y axis shows the count or proportion of observations that lie within that bin.

    What you have is observations of three variables: the month, the year and the number of orders. You wish to show the number of orders on the y axis as a function of month, and also display the year as a grouping variable. It therefore appears that you are looking for a dodged bar chart. Perhaps something like this:

    ggplot(df, aes(month, nrOrders, fill = acquiYear)) +
      geom_col(position = 'dodge') +  
      xlab("Month") +   
      ylab("Nr. of Orders") +    
      ggtitle(paste("Delivery year 2018")) +   
      theme_classic() +   
      theme(plot.title = element_text(face = "bold", hjust = 0.5)) +   
      theme(axis.title = element_text(face = "bold")) +
      scale_fill_manual(values = c("#00943C", "#4A52A0", "#FDC300", 
                                   "#6F6F6F", "#EC4C24")) 
    

    enter image description here

    Similarly, adding a density curve for each year doesn't make any sense here. A density curve shows the density of measurements of a single variable over a continuous range (a bit like a smoothed histogram), whereas you have equally-spaced measurements that are already fully described by the bars.

    You could add a smooth curve for each of the years, but the plot is already complex and the curves would not add any information; in fact, they would obscure the data that your plot already shows:

    ggplot(df, aes(as.numeric(month), nrOrders, fill = acquiYear)) +
      geom_col(position = 'dodge') +  
      ggalt::stat_xspline(geom = 'area', spline_shape = -0.4, alpha = 0.3) +
      xlab("Month") +   
      ylab("Nr. of Orders") +    
      ggtitle(paste("Delivery year 2018")) +   
      theme_classic() +   
      theme(plot.title = element_text(face = "bold", hjust = 0.5)) +   
      theme(axis.title = element_text(face = "bold")) +
      scale_fill_manual(values = c("#00943C", "#4A52A0", "#FDC300", 
                                   "#6F6F6F", "#EC4C24")) +
      scale_x_continuous(breaks = 1:12, labels = month.abb)
    

    enter image description here

    If you really want to do this, you may find that faceting gives a clearer picture:

    ggplot(df, aes(as.numeric(month), nrOrders, fill = acquiYear)) +
      geom_col(position = 'dodge', width = 0.5) +  
      ggalt::stat_xspline(geom = 'area', spline_shape = -0.4, alpha = 0.5) +
      xlab("Month") +   
      ylab("Nr. of Orders") +    
      ggtitle(paste("Delivery year 2018")) +
      facet_wrap(.~acquiYear, ncol = 1) +
      theme_classic() +   
      theme(plot.title = element_text(face = "bold", hjust = 0.5)) +   
      theme(axis.title = element_text(face = "bold")) +
      scale_fill_manual(values = c("#00943C", "#4A52A0", "#FDC300", 
                                   "#6F6F6F", "#EC4C24")) +
      scale_x_continuous(breaks = 1:12, labels = month.abb)
    

    enter image description here