I'm using Canada's census data with variables Wage on the x-axis and the density on the y-axis. I'm trying to overlay the graph I've created with the log-normal distribution dlnorm
but I'm not sure what to use as the meanlog and sdlog parameter values. I've tried using mean(data$Wages)
and sd(data$Wages)
, as well as taking the natural logarithm of both, etc. Nothing gives me a graph remotely similar to the density histogram I have generated.
Is this because my data is not log-normal? How can I find the correct meanlog and sdlog parameters?
This is my code:
inc_plot <- data_adults %>%
ggplot(aes(x=Wages)) +
geom_histogram(aes(y=..density..), bins=100,fill="transparent", colour="black")+
scale_x_continuous(labels=scales::comma) +
stat_function(fun = dlnorm,
args = list(meanlog = 48637.91, sdlog = 62459.15),
col = "red")
inc_plot
The current parameters are by using the aforementioned mean()
and sd()
functions.
If you set meanlog = mean(log(your_data))
and likewise sdlog = sd(log(your_data))
the density should approach the histogram.
library(ggplot2)
df <- data.frame(x = rlnorm(1e4))
ggplot(df, aes(x)) +
geom_histogram(
aes(y = after_stat(density)),
bins = 100, fill = "transparent", colour = "black"
) +
stat_function(
fun = dlnorm,
args = list(meanlog = mean(log(df$x)), sdlog = sd(log(df$x))),
colour = "red"
)
Created on 2021-08-23 by the reprex package (v2.0.1)
An alternative would be to use ggh4x::stat_theodensity(distri = "lnorm", colour = "red")
. (disclaimer: I'm the author of ggh4x)