Search code examples
rggplot2statisticsstatsmodels

Overlaying data's density histogram with dlnorm in R, ggplot


I'm using Canada's census data with variables Wage on the x-axis and the density on the y-axis. I'm trying to overlay the graph I've created with the log-normal distribution dlnorm but I'm not sure what to use as the meanlog and sdlog parameter values. I've tried using mean(data$Wages) and sd(data$Wages), as well as taking the natural logarithm of both, etc. Nothing gives me a graph remotely similar to the density histogram I have generated.

Is this because my data is not log-normal? How can I find the correct meanlog and sdlog parameters?

This is my code:

inc_plot <- data_adults %>%
  ggplot(aes(x=Wages)) +
  geom_histogram(aes(y=..density..),  bins=100,fill="transparent", colour="black")+
  scale_x_continuous(labels=scales::comma) +
  stat_function(fun = dlnorm,
      args = list(meanlog = 48637.91, sdlog = 62459.15),
      col = "red")

inc_plot

The current parameters are by using the aforementioned mean() and sd() functions.

enter image description here


Solution

  • If you set meanlog = mean(log(your_data)) and likewise sdlog = sd(log(your_data)) the density should approach the histogram.

    library(ggplot2)
    
    
    df <- data.frame(x = rlnorm(1e4))
    
    ggplot(df, aes(x)) +
      geom_histogram(
        aes(y = after_stat(density)),
        bins = 100, fill = "transparent", colour = "black"
      ) +
      stat_function(
        fun = dlnorm,
        args = list(meanlog = mean(log(df$x)), sdlog = sd(log(df$x))),
        colour = "red"
      )
    

    Created on 2021-08-23 by the reprex package (v2.0.1)

    An alternative would be to use ggh4x::stat_theodensity(distri = "lnorm", colour = "red"). (disclaimer: I'm the author of ggh4x)