Return CDF (as a function) given a vector of values and their corresponding probabilities

Consider the following simplified example:

x <- c(1, 2, 3)
p <- rep(1, times = 3)/3

This indicates that I have a discrete probability distribution with probability 1/3 assigned to each of 1, 2, and 3. You may assume that x and p have been ordered appropriately as above (i.e., each component of x has corresponding probability in p), with x in ascending order as above. You should not assume that x only takes on integer values, and you should not assume that p is identical in every component. In my actual problem, x and p can be vectors with length of approximately 100.

I would like to output a function (not a graph, like what I have seen from other examples) which outputs values equal to the cumulative distribution function of the probability mass function given above in R, using only x and p.

For the probability background: if you're not familiar with probability, the cumulative distribution function is the probability that you obtain a value less than or equal to a certain value. Let's call this "certain value" t.

If I give you any value t < 1, then based on the example above, the cumulative distribution function should output 0, since no probabilities are assigned to values less than 1.

Suppose I give you a value t satisfying t >= 1 and t < 2. Then in this interval, one has that the probability assigned to 1 is 1/3, hence for t >= 1 and t < 2, the cumulative distribution function should output 1/3.

If t >= 2 and t < 3, up to this point, based on the prior discussion, we have a probability of 1/3 from the prior step, as well as a probability of 1/3 at t = 2. Hence, if t >= 2 and t < 3, the cumulative distribution function should output 2/3.

If t >= 3, the cumulative distribution function should output 1.

We could theoretically code this function out as follows and have t be the only argument:

x_cdf <- function(t) {
  if (t < 1) {
    return(0)
  }
  if (t >= 1 & t < 2) {
    return(1/3)
  }
  if (t >= 2 & t < 3) {
    return(2/3)
  }
  if (t >= 3) {
    return(1)
  }
}

However, the difficulty here, from my perspective, is generating the if statements based on the vectors x and p.

To re-emphasize: the CDF should only depend on t as an argument, and should be readily generated from x and p. It is necessary that t be allowed to be a value that is NOT in the vector x.

Pseudocode of what I think I'm looking for:

generate_cdf <- function(x, p) {
  cdf <- function(t) {
    # some stuff here that depends on x and p that I'm not sure how to code
  }
  return(cdf)
}

Solution

You are basically there.

Put this in the function body you are creating:

sum( p[ x <= t ] )


generate_cdf <- function(x, p) {
    cdf <- function(t) {
        sum( p[ x <= t ] )
    }
    return(cdf)
}

f <- generate_cdf(x, p)

cbind( 0:4, sapply( 0:4, f ) )

Outputs:


> cbind( 0:4, sapply( 0:4, f ) )
     [,1]      [,2]
[1,]    0 0.0000000
[2,]    1 0.3333333
[3,]    2 0.6666667
[4,]    3 1.0000000
[5,]    4 1.0000000

Vectorize

For added finesse, you might Vectorize it also, letting it process multiple values in one go:


generate_cdf <- function(x, p) {
    cdf <- function(t) {
        sum( p[ x <= t ] )
    }
    return(Vectorize(cdf))
}

f <- generate_cdf(x, p)

f( c(1,2) ) # outputs [1] 0.3333333 0.6666667