I have been struggling with how R calculates quantiles and the normal fitting of data. I have data (NDVI values) that follows a truncated normal distribution (see figure)
I am interested in getting the lowest 10th percentile value (p=0.1) from the data and from the fitting normal distribution curve.
In my understanding, because the data is truncated, the two should be quite different: I expect the quantile from the data to be higher than the one calculated from the normal distribution, but this is not so. For what I understand of the quantile function help the quantile from the data should be the default quantile function:
q=quantile(y, p=0.1)
while the quantile from the normal distribution is :
qx=quantile(y, p=0.1, type=9)
However the two result very close in all cases, which makes me wonder to what type of distribution does R fit the data to calculate the quantile (truncated normal dist.?)
I have also tried to calculate the quantile based on the fitting normal curve as:
fitted=fitdist(as.numeric(y), "norm", discrete = T)
fit.q=as.numeric(quantile(fitted, p=0.1)[[1]][1])
but obtaining no difference.
So my questions are: To what curve does R fit the data for calculating quantiles, in particular for type=9 ? How can I calculate the quantile based on the complete normal distribution (including the lower tail)?
I don't know how to generate a reproducible example for this, but the data is available at https://dl.dropboxusercontent.com/u/26249349/data.csv
Thanks!
R is using the empirical ordering of the data when determining quantiles, rather than assuming any particular distribution.
The 10th percentile for your truncated data and a normal distribution fit to your data happen to be pretty close, although the 1st percentile is quite a bit different. For example:
# Load data
df = read.csv("data.csv", header=TRUE, stringsAsFactors=FALSE)
# Fit a normal distribution to the data
df.dist = fitdist(df$x, "norm", discrete = T)
Now let's get quantiles of the fitted distribution and the original data. I've included the 1st percentile in addition to the 10th percentile. You can see that the fitted normal distribution's 10th percentile is just a bit lower than that of the data. However, the 1st percentile of the fitted normal distribution is much lower.
quantile(df.dist, p=c(0.01, 0.1))
Estimated quantiles for each specified probability (non-censored data) p=0.01 p=0.1 estimate 1632.829 2459.039
quantile(df$x, p=c(0.01, 0.1))
1% 10% 2064.79 2469.90
quantile(df$x, p=c(0.01, 0.1), type=9)
1% 10% 2064.177 2469.400
You can also see this by direct ranking of the data and by getting the 1st and 10th percentiles of a normal distribution with mean and sd equal to the fitted values from fitdist
:
# 1st and 10th percentiles of data by direct ranking
df$x[order(df$x)][round(c(0.01,0.1)*5780)]
[1] 2064 2469
# 1st and 10th percentiles of fitted distribution
qnorm(c(0.01,0.1), df.dist$estimate[1], df.dist$estimate[2])
[1] 1632.829 2459.039
Let's plot histograms of the original data (blue) and of fake data generated from the fitted normal distribution (red). The area of overlap is purple.
# Histogram of data (blue)
hist(df$x, xlim=c(0,8000), ylim=c(0,1600), col="#0000FF80")
# Overlay histogram of random draws from fitted normal distribution (red)
set.seed(685)
set.seed(685)
x.fit = rnorm(length(df$x), df.dist$estimate[1], df.dist$estimate[2])
hist(x.fit, add=TRUE, col="#FF000080")
Or we can plot the empirical cumulative distribution function (ecdf) for the data (blue) and the random draws from the fitted normal distribution (red). The horizontal grey line marks the 10th percentile:
plot(ecdf(df$x), xlim=c(0,8000), col="blue")
lines(ecdf(x.fit), col="red")
abline(0.1,0, col="grey40", lwd=2, lty="11")
Now that I've gone through this, I'm wondering if you were expecting fitdist
to return the parameters of the normal distribution we would have gotten had your data really come from a normal distribution and not been truncated. Rather, fitdist
returns a normal distribution with the mean and sd of the (truncated) data at hand, so the distribution returned by fitdist
is shifted to the right compared to where we might have "expected" it to be.
c(mean=mean(df$x), sd=sd(df$x))
mean sd 3472.4708 790.8538
df.dist$estimate
mean sd 3472.4708 790.7853
Or, another quick example: x
is normally distributed with mean ~ 0 and sd ~ 1. xtrunc
removes all values less than -1, and xtrunc.dist
is the output of fitdist
on xtrunc
:
set.seed(55)
x = rnorm(6000)
xtrunc = x[x > -1]
xtrunc.dist = fitdist(xtrunc, "norm")
round(cbind(sapply(list(x=x,xtrunc=xtrunc), function(x) c(mean=mean(x),sd=sd(x))),
xtrunc.dist=xtrunc.dist$estimate),3)
x xtrunc xtrunc.dist
mean -0.007 0.275 0.275
sd 1.009 0.806 0.806
And you can see in the ecdf plot below that the truncated data and the normal distribution fitted to the truncated data have about the same 10th percentile, while the 10th percentile of the untruncated data is (as we would expect) shifted to the left.