Search code examples
rprobability

Return CDF (as a function) given a vector of values and their corresponding probabilities


Consider the following simplified example:

x <- c(1, 2, 3)
p <- rep(1, times = 3)/3

This indicates that I have a discrete probability distribution with probability 1/3 assigned to each of 1, 2, and 3. You may assume that x and p have been ordered appropriately as above (i.e., each component of x has corresponding probability in p), with x in ascending order as above. You should not assume that x only takes on integer values, and you should not assume that p is identical in every component. In my actual problem, x and p can be vectors with length of approximately 100.

I would like to output a function (not a graph, like what I have seen from other examples) which outputs values equal to the cumulative distribution function of the probability mass function given above in R, using only x and p.


For the probability background: if you're not familiar with probability, the cumulative distribution function is the probability that you obtain a value less than or equal to a certain value. Let's call this "certain value" t.

If I give you any value t < 1, then based on the example above, the cumulative distribution function should output 0, since no probabilities are assigned to values less than 1.

Suppose I give you a value t satisfying t >= 1 and t < 2. Then in this interval, one has that the probability assigned to 1 is 1/3, hence for t >= 1 and t < 2, the cumulative distribution function should output 1/3.

If t >= 2 and t < 3, up to this point, based on the prior discussion, we have a probability of 1/3 from the prior step, as well as a probability of 1/3 at t = 2. Hence, if t >= 2 and t < 3, the cumulative distribution function should output 2/3.

If t >= 3, the cumulative distribution function should output 1.


We could theoretically code this function out as follows and have t be the only argument:

x_cdf <- function(t) {
  if (t < 1) {
    return(0)
  }
  if (t >= 1 & t < 2) {
    return(1/3)
  }
  if (t >= 2 & t < 3) {
    return(2/3)
  }
  if (t >= 3) {
    return(1)
  }
}

However, the difficulty here, from my perspective, is generating the if statements based on the vectors x and p.

To re-emphasize: the CDF should only depend on t as an argument, and should be readily generated from x and p. It is necessary that t be allowed to be a value that is NOT in the vector x.


Pseudocode of what I think I'm looking for:

generate_cdf <- function(x, p) {
  cdf <- function(t) {
    # some stuff here that depends on x and p that I'm not sure how to code
  }
  return(cdf)
}

Solution

  • You are basically there.

    Put this in the function body you are creating:

    sum( p[ x <= t ] )
    
    
    generate_cdf <- function(x, p) {
        cdf <- function(t) {
            sum( p[ x <= t ] )
        }
        return(cdf)
    }
    
    f <- generate_cdf(x, p)
    
    cbind( 0:4, sapply( 0:4, f ) )
    
    

    Outputs:

    
    > cbind( 0:4, sapply( 0:4, f ) )
         [,1]      [,2]
    [1,]    0 0.0000000
    [2,]    1 0.3333333
    [3,]    2 0.6666667
    [4,]    3 1.0000000
    [5,]    4 1.0000000
    
    

    Vectorize

    For added finesse, you might Vectorize it also, letting it process multiple values in one go:

    
    generate_cdf <- function(x, p) {
        cdf <- function(t) {
            sum( p[ x <= t ] )
        }
        return(Vectorize(cdf))
    }
    
    f <- generate_cdf(x, p)
    
    f( c(1,2) ) # outputs [1] 0.3333333 0.6666667