I want to plot a histogram and then overlay that with a normal distribution that represents the distribution of the data. However, my data is already counted:
df<- structure(list(trips = c(12955L, 36890L, 47035L, 48650L, 70910L,
93755L, 45315L, 16565L, 4725L, 9460L), dist.km = c(0.5, 2, 4,
8.5, 12.5, 17.5, 22.5, 27.5, 32.5, 42.5), share = c(0.03, 0.09,
0.12, 0.13, 0.18, 0.24, 0.12, 0.04, 0.01, 0.02)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Since the data is already counted, I can use barplot
instead of hist
:
barplot(df$share,
names.arg=census.car$dist.km,
col="orange",
xlab="km",
ylab="trips")
Two questions:
barplot
in this case?Q1: if you do not have the original data, then you cannot use hist
.
Q2: with some work.
First, barplot
does not provide a discrete x-axis. Looking at your plot, this is clearly shown where the spacing between the first two columns (2-0.5 = 1.5) is the same as the last two columns (42.5-32.5 = 10). You can get the x-axis mid-points by looking at the (invisible) return value of barplot
:
(barplot(df$share, names.arg=df$dist.km,
col="orange", xlab="km", ylab="trips"))
# [,1]
# [1,] 0.7
# [2,] 1.9
# [3,] 3.1
# [4,] 4.3
# [5,] 5.5
# [6,] 6.7
# [7,] 7.9
# [8,] 9.1
# [9,] 10.3
# [10,] 11.5
The points are equi-distant despite the actual points not doing that. This equi-distance is because R is effectively assuming categorical data, not continuous.
To compensate for that, we can either adjust the width of the plots or the space between them. If we changed the width, then we would be conflating width with visual importance, something we should avoid, so let's go with "space":
(bp <- barplot(df$share, names.arg=df$dist.km,
space = c(0, diff(df$dist.km)),
col="orange", xlab="km", ylab="trips"))
# [,1]
# [1,] 0.5
# [2,] 3.0
# [3,] 6.0
# [4,] 11.5
# [5,] 16.5
# [6,] 22.5
# [7,] 28.5
# [8,] 34.5
# [9,] 40.5
# [10,] 51.5
In order to plot a normal curve, we need the original distribution's mean and standard deviation. Without the original data, we can approximate it with weighted mean and weighted standard deviation, both provided by the Hmisc
package.
mu <- Hmisc::wtd.mean(df$dist.km, df$trips)
sigma <- sqrt(Hmisc::wtd.var(df$dist.km, weights = df$trips))
c(mu, sigma)
# [1] 13.565338 8.911899
Unfortunately, as we see in the output from the second barplot
above, the x-axis is not on the same scale as the data. Luckily, it's still continuous and linear for us, so we just need to adjust for that. We can calculate it manually, but for the sake of argument, here's a back-transformation function:
func <- function(a) {
(min(df$dist.km) - bp[1,1]) + # the offset, happens to be 0 here since
# the first datapoint is exactly 0.5
a * diff(range(bp[,1])) / diff(range(df$dist.km))
}
mu2 <- func(mu)
sigma2 <- sigma
c(mu2, sigma2)
# [1] 16.472196 8.911899
Notice that we do not adjust the deviation: recall (from your statistics class) that when you add a value to all data in the source, the "location" stats (e.g., mean, median) are adjusted similarly (add the value) but the variance is unchanged.
So we can now use curve
to add that to the plot:
curve(dnorm(x, mean=mu2, sd=sigma2),
col = "red", lwd = 2, add=TRUE)
Note: The function call we gave as the first argument to curve
needs the x
variable there, even though we haven't defined it. This is used internally to curve
and replaced with the actual vector of values. It can be different, perhaps with curve(dnorm(yy,...), xname="yy")
.
Aesthetically it is not high enough ... we can scale it with the max frequency:
# start over
bp <- barplot(df$share, names.arg=df$dist.km,
space = c(0, diff(df$dist.km)),
col="orange", xlab="km", ylab="trips")
curve(dnorm(x, mean=mu2, sd=sigma2) / max(df$share),
col = "red", lwd = 2, add=TRUE)
Last point: this normal curve is an approximation, and though it is good it is still imperfect. If you have the original data, it would be much better to use hist
and actual mu/sigma values.