Search code examples
rbar-chartnormal-distribution

Adding a normal distribution to a bar chart in R


I want to plot a histogram and then overlay that with a normal distribution that represents the distribution of the data. However, my data is already counted:

df<- structure(list(trips = c(12955L, 36890L, 47035L, 48650L, 70910L, 
93755L, 45315L, 16565L, 4725L, 9460L), dist.km = c(0.5, 2, 4, 
8.5, 12.5, 17.5, 22.5, 27.5, 32.5, 42.5), share = c(0.03, 0.09, 
0.12, 0.13, 0.18, 0.24, 0.12, 0.04, 0.01, 0.02)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

Since the data is already counted, I can use barplot instead of hist:

barplot(df$share, 
          names.arg=census.car$dist.km,
          col="orange", 
          xlab="km", 
          ylab="trips")

enter image description here

Two questions:

  1. Is there a way to plot the histogram directly instead of using barplot in this case?
  2. How can I overlay this bar plot with a normal distribution line that fits my data?

Solution

  • Q1: if you do not have the original data, then you cannot use hist.

    Q2: with some work.

    First, barplot does not provide a discrete x-axis. Looking at your plot, this is clearly shown where the spacing between the first two columns (2-0.5 = 1.5) is the same as the last two columns (42.5-32.5 = 10). You can get the x-axis mid-points by looking at the (invisible) return value of barplot:

    (barplot(df$share, names.arg=df$dist.km,
             col="orange", xlab="km", ylab="trips"))
    #       [,1]
    #  [1,]  0.7
    #  [2,]  1.9
    #  [3,]  3.1
    #  [4,]  4.3
    #  [5,]  5.5
    #  [6,]  6.7
    #  [7,]  7.9
    #  [8,]  9.1
    #  [9,] 10.3
    # [10,] 11.5
    

    The points are equi-distant despite the actual points not doing that. This equi-distance is because R is effectively assuming categorical data, not continuous.

    To compensate for that, we can either adjust the width of the plots or the space between them. If we changed the width, then we would be conflating width with visual importance, something we should avoid, so let's go with "space":

    (bp <- barplot(df$share, names.arg=df$dist.km,
                   space = c(0, diff(df$dist.km)),
                   col="orange", xlab="km", ylab="trips"))
    #       [,1]
    #  [1,]  0.5
    #  [2,]  3.0
    #  [3,]  6.0
    #  [4,] 11.5
    #  [5,] 16.5
    #  [6,] 22.5
    #  [7,] 28.5
    #  [8,] 34.5
    #  [9,] 40.5
    # [10,] 51.5
    

    barplot adjusted for non-equi-distant spacing

    In order to plot a normal curve, we need the original distribution's mean and standard deviation. Without the original data, we can approximate it with weighted mean and weighted standard deviation, both provided by the Hmisc package.

    mu <- Hmisc::wtd.mean(df$dist.km, df$trips)
    sigma <- sqrt(Hmisc::wtd.var(df$dist.km, weights = df$trips))
    c(mu, sigma)
    # [1] 13.565338  8.911899
    

    Unfortunately, as we see in the output from the second barplot above, the x-axis is not on the same scale as the data. Luckily, it's still continuous and linear for us, so we just need to adjust for that. We can calculate it manually, but for the sake of argument, here's a back-transformation function:

    func <- function(a) {
      (min(df$dist.km) - bp[1,1]) + # the offset, happens to be 0 here since
                                    # the first datapoint is exactly 0.5
        a * diff(range(bp[,1])) / diff(range(df$dist.km))
    }
    
    mu2 <- func(mu)
    sigma2 <- sigma
    c(mu2, sigma2)
    # [1] 16.472196  8.911899
    

    Notice that we do not adjust the deviation: recall (from your statistics class) that when you add a value to all data in the source, the "location" stats (e.g., mean, median) are adjusted similarly (add the value) but the variance is unchanged.

    So we can now use curve to add that to the plot:

    curve(dnorm(x, mean=mu2, sd=sigma2),
          col = "red", lwd = 2, add=TRUE)
    

    barplot with un-scaled normal curve

    Note: The function call we gave as the first argument to curve needs the x variable there, even though we haven't defined it. This is used internally to curve and replaced with the actual vector of values. It can be different, perhaps with curve(dnorm(yy,...), xname="yy").

    Aesthetically it is not high enough ... we can scale it with the max frequency:

    # start over
    bp <- barplot(df$share, names.arg=df$dist.km,
                  space = c(0, diff(df$dist.km)),
                  col="orange", xlab="km", ylab="trips")
    curve(dnorm(x, mean=mu2, sd=sigma2) / max(df$share),
          col = "red", lwd = 2, add=TRUE)
    

    barplot with adjusted normal curve

    Last point: this normal curve is an approximation, and though it is good it is still imperfect. If you have the original data, it would be much better to use hist and actual mu/sigma values.