I am using ecdf()
function to calculate empirical cumulative density function (ECDF) from some random samples:
set.seed(0)
X = rnorm(100)
P = ecdf(X)
Now P
gives ECDF and we could plot it:
plot(P)
abline(h = 0.6, lty = 3)
My question is: how can I find the sample value x
, such that P(x) = 0.6
, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6
?
In the following, I will not use ecdf()
, as it is easy to obtain empirical cumulative density function (ECDF) ourselves.
First, we sort samples X
in ascending order:
X <- sort(X)
ECDF at those samples takes function values:
e_cdf <- 1:length(X) / length(X)
We could then sketch ECDF by:
plot(X, e_cdf, type = "s")
abline(h = 0.6, lty = 3)
Now, we are looking for the first value of X
, such that P(X) >= 0.6
. This is just:
X[which(e_cdf >= 0.6)[1]]
# [1] 0.2290196
Since our data are sampled from standard normal distribution, the theoretical quantile is
qnorm(0.6)
# [1] 0.2533471
So our result is quite close.
Since the inverse of CDF is quantile function (for example, the inverse of pnorm()
is qnorm()
), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf()
is quantile()
. This is not true!
ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x
, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.
For n
sorted samples, sample quantile function is in fact a linear interpolation function of (x, y)
, with:
seq(0, 1, length = n)
;We can define our own version of sample quantile function by:
my_quantile <- function(x, prob) {
if (is.unsorted(x)) x <- sort(x)
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
Let's have a test:
my_quantile(X, 0.6)
# [1] 0.2343171
quantile(X, prob = 0.6, names = FALSE)
# [1] 0.2343171
Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]]
.
It is for this reason, that I refuse to use quantile()
in my answer.