Search code examples
rdataframequantileecdf

Calculate a percentile of dataframe column efficiently


This question is an extension to the StackOverflow question asked and answered here.

My circumstances are different in that I want to calculate the percentile of each value within a vector of 50,000 (or more!) values. For example --

df <- data.frame(val = rnorm(n = 50000, mean = 50, sd = 20))
df$val.percentile <- sapply(X = df$val, function(x) ecdf(df$val)(x))
head(df)

Is there a good way to optimize the process for calculating the percentile for each value? Essentially I'd like to make it as efficient as possible so the run time is as small as possible.


Solution

  • You can implement dplyr::percent_rank() to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist() (Proportion of all values less than or equal to the current rank).

    Reproducible example:

    set.seed(1)
    df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))
    

    Show that percent_rank() differs from cume_dist() and that cume_dist() is the same as ecdf(x)(x):

    library(tidyverse)
    
    head(df) %>% 
      mutate(pr  = percent_rank(val), 
             cd  = ecdf(val)(val), 
             cd2 = cume_dist(val))
    
           val  pr        cd       cd2
    1 37.47092 0.4 0.5000000 0.5000000
    2 53.67287 0.6 0.6666667 0.6666667
    3 33.28743 0.0 0.1666667 0.1666667
    4 81.90562 1.0 1.0000000 1.0000000
    5 56.59016 0.8 0.8333333 0.8333333
    6 33.59063 0.2 0.3333333 0.3333333
    

    Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:

    library(microbenchmark)
    mbm <- microbenchmark(
        pr_dplyr = mutate(df, pr = percent_rank(val)),
        cd_dplyr = mutate(df, pr = percent_rank(val)),
        cd_base  = mutate(df, pr = ecdf(val)(val)),
        times = 20
    )
    
    autoplot(mbm)
    

    enter image description here