Consider the following simplified example:
x <- c(1, 2, 3)
p <- rep(1, times = 3)/3
This indicates that I have a discrete probability distribution with probability 1/3
assigned to each of 1
, 2
, and 3
. You may assume that x
and p
have been ordered appropriately as above (i.e., each component of x
has corresponding probability in p
), with x
in ascending order as above. You should not assume that x
only takes on integer values, and you should not assume that p
is identical in every component. In my actual problem, x
and p
can be vectors with length of approximately 100
.
I would like to output a function (not a graph, like what I have seen from other examples) which outputs values equal to the cumulative distribution function of the probability mass function given above in R, using only x
and p
.
For the probability background: if you're not familiar with probability, the cumulative distribution function is the probability that you obtain a value less than or equal to a certain value. Let's call this "certain value" t
.
If I give you any value t < 1
, then based on the example above, the cumulative distribution function should output 0
, since no probabilities are assigned to values less than 1
.
Suppose I give you a value t
satisfying t >= 1
and t < 2
. Then in this interval, one has that the probability assigned to 1
is 1/3
, hence for t >= 1
and t < 2
, the cumulative distribution function should output 1/3
.
If t >= 2
and t < 3
, up to this point, based on the prior discussion, we have a probability of 1/3
from the prior step, as well as a probability of 1/3
at t = 2
. Hence, if t >= 2
and t < 3
, the cumulative distribution function should output 2/3
.
If t >= 3
, the cumulative distribution function should output 1
.
We could theoretically code this function out as follows and have t
be the only argument:
x_cdf <- function(t) {
if (t < 1) {
return(0)
}
if (t >= 1 & t < 2) {
return(1/3)
}
if (t >= 2 & t < 3) {
return(2/3)
}
if (t >= 3) {
return(1)
}
}
However, the difficulty here, from my perspective, is generating the if
statements based on the vectors x
and p
.
To re-emphasize: the CDF should only depend on t
as an argument, and should be readily generated from x
and p
. It is necessary that t
be allowed to be a value that is NOT in the vector x
.
generate_cdf <- function(x, p) {
cdf <- function(t) {
# some stuff here that depends on x and p that I'm not sure how to code
}
return(cdf)
}
You are basically there.
Put this in the function body you are creating:
sum( p[ x <= t ] )
generate_cdf <- function(x, p) {
cdf <- function(t) {
sum( p[ x <= t ] )
}
return(cdf)
}
f <- generate_cdf(x, p)
cbind( 0:4, sapply( 0:4, f ) )
Outputs:
> cbind( 0:4, sapply( 0:4, f ) )
[,1] [,2]
[1,] 0 0.0000000
[2,] 1 0.3333333
[3,] 2 0.6666667
[4,] 3 1.0000000
[5,] 4 1.0000000
For added finesse, you might Vectorize it also, letting it process multiple values in one go:
generate_cdf <- function(x, p) {
cdf <- function(t) {
sum( p[ x <= t ] )
}
return(Vectorize(cdf))
}
f <- generate_cdf(x, p)
f( c(1,2) ) # outputs [1] 0.3333333 0.6666667