Adding a normal distribution to a bar chart in R

I want to plot a histogram and then overlay that with a normal distribution that represents the distribution of the data. However, my data is already counted:

df<- structure(list(trips = c(12955L, 36890L, 47035L, 48650L, 70910L, 
93755L, 45315L, 16565L, 4725L, 9460L), dist.km = c(0.5, 2, 4, 
8.5, 12.5, 17.5, 22.5, 27.5, 32.5, 42.5), share = c(0.03, 0.09, 
0.12, 0.13, 0.18, 0.24, 0.12, 0.04, 0.01, 0.02)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

Since the data is already counted, I can use barplot instead of hist:

barplot(df$share, 
          names.arg=census.car$dist.km,
          col="orange", 
          xlab="km", 
          ylab="trips")

Two questions:

Is there a way to plot the histogram directly instead of using barplot in this case?
How can I overlay this bar plot with a normal distribution line that fits my data?

Solution

Q1: if you do not have the original data, then you cannot use hist.

Q2: with some work.

First, barplot does not provide a discrete x-axis. Looking at your plot, this is clearly shown where the spacing between the first two columns (2-0.5 = 1.5) is the same as the last two columns (42.5-32.5 = 10). You can get the x-axis mid-points by looking at the (invisible) return value of barplot:

(barplot(df$share, names.arg=df$dist.km,
         col="orange", xlab="km", ylab="trips"))
#       [,1]
#  [1,]  0.7
#  [2,]  1.9
#  [3,]  3.1
#  [4,]  4.3
#  [5,]  5.5
#  [6,]  6.7
#  [7,]  7.9
#  [8,]  9.1
#  [9,] 10.3
# [10,] 11.5

The points are equi-distant despite the actual points not doing that. This equi-distance is because R is effectively assuming categorical data, not continuous.

To compensate for that, we can either adjust the width of the plots or the space between them. If we changed the width, then we would be conflating width with visual importance, something we should avoid, so let's go with "space":

(bp <- barplot(df$share, names.arg=df$dist.km,
               space = c(0, diff(df$dist.km)),
               col="orange", xlab="km", ylab="trips"))
#       [,1]
#  [1,]  0.5
#  [2,]  3.0
#  [3,]  6.0
#  [4,] 11.5
#  [5,] 16.5
#  [6,] 22.5
#  [7,] 28.5
#  [8,] 34.5
#  [9,] 40.5
# [10,] 51.5

In order to plot a normal curve, we need the original distribution's mean and standard deviation. Without the original data, we can approximate it with weighted mean and weighted standard deviation, both provided by the Hmisc package.

mu <- Hmisc::wtd.mean(df$dist.km, df$trips)
sigma <- sqrt(Hmisc::wtd.var(df$dist.km, weights = df$trips))
c(mu, sigma)
# [1] 13.565338  8.911899

Unfortunately, as we see in the output from the second barplot above, the x-axis is not on the same scale as the data. Luckily, it's still continuous and linear for us, so we just need to adjust for that. We can calculate it manually, but for the sake of argument, here's a back-transformation function:

func <- function(a) {
  (min(df$dist.km) - bp[1,1]) + # the offset, happens to be 0 here since
                                # the first datapoint is exactly 0.5
    a * diff(range(bp[,1])) / diff(range(df$dist.km))
}

mu2 <- func(mu)
sigma2 <- sigma
c(mu2, sigma2)
# [1] 16.472196  8.911899

Notice that we do not adjust the deviation: recall (from your statistics class) that when you add a value to all data in the source, the "location" stats (e.g., mean, median) are adjusted similarly (add the value) but the variance is unchanged.

So we can now use curve to add that to the plot:

curve(dnorm(x, mean=mu2, sd=sigma2),
      col = "red", lwd = 2, add=TRUE)

Note: The function call we gave as the first argument to curve needs the x variable there, even though we haven't defined it. This is used internally to curve and replaced with the actual vector of values. It can be different, perhaps with curve(dnorm(yy,...), xname="yy").

Aesthetically it is not high enough ... we can scale it with the max frequency:

# start over
bp <- barplot(df$share, names.arg=df$dist.km,
              space = c(0, diff(df$dist.km)),
              col="orange", xlab="km", ylab="trips")
curve(dnorm(x, mean=mu2, sd=sigma2) / max(df$share),
      col = "red", lwd = 2, add=TRUE)

Last point: this normal curve is an approximation, and though it is good it is still imperfect. If you have the original data, it would be much better to use hist and actual mu/sigma values.