Search code examples
rdata.tablepercentileecdf

Data.table in R: Fitting a function on a different subgroup than the function's application group


In a large data.table, I would like to compute the percentile of a numeric variable with respect to a date variable, and with respect to a subgroup of a group variable.

Example: In the table below I compute each value's percentile with respect to date and all members of the group-variable:

library(data.table)

set.seed(101)
dt <- data.table(group = c(rep("A",3),rep("B",3),rep("C",3), rep("D",3)), 
date = rep(as.Date(19000:19002, origin = "1970-01-01"),4), 
value=rnorm(12))
 
dt[, Percentile := ecdf(value)(value), date]

> dt
    group       date      value Percentile
 1:     A 2022-01-08 -0.3260365       0.25
 2:     A 2022-01-09  0.5524619       1.00
 3:     A 2022-01-10 -0.6749438       0.50
 4:     B 2022-01-08  0.2143595       0.75
 5:     B 2022-01-09  0.3107692       0.50
 6:     B 2022-01-10  1.1739663       1.00
 7:     C 2022-01-08  0.6187899       1.00
 8:     C 2022-01-09 -0.1127343       0.25
 9:     C 2022-01-10  0.9170283       0.75
10:     D 2022-01-08 -0.2232594       0.50
11:     D 2022-01-09  0.5264481       0.75
12:     D 2022-01-10 -0.7948444       0.25

Is there a way to compute the percentile of each value with respect to date, but the ecdf()-function only being fitted to group members, e.g. c("A","B","C")?

The result should look like this:

> dt
   group       date      value Percentile
1:     A 2022-01-08 -0.3260365  0.3333333
2:     A 2022-01-09  0.5524619  1.0000000
3:     A 2022-01-10 -0.6749438  0.3333333
4:     B 2022-01-08  0.2143595  0.6666667
5:     B 2022-01-09  0.3107692  0.6666667
6:     B 2022-01-10  1.1739663  1.0000000
7:     C 2022-01-08  0.6187899  1.0000000
8:     C 2022-01-09 -0.1127343  0.3333333
9:     C 2022-01-10  0.9170283  0.6666667
10:    D 2022-01-08 -0.2232594  0.3333333
11:    D 2022-01-09  0.5264481  0.6666667
12:    D 2022-01-10 -0.7948444  0.0000000

Thank you!


Solution

  • We could use

    library(data.table)
    dt[, Percentile := ecdf(value[group %in% c("A", "B", "C")])(value), date]
    

    -output

    > dt
         group       date      value Percentile
        <char>     <Date>      <num>      <num>
     1:      A 2022-01-08 -0.3260365  0.3333333
     2:      A 2022-01-09  0.5524619  1.0000000
     3:      A 2022-01-10 -0.6749438  0.3333333
     4:      B 2022-01-08  0.2143595  0.6666667
     5:      B 2022-01-09  0.3107692  0.6666667
     6:      B 2022-01-10  1.1739663  1.0000000
     7:      C 2022-01-08  0.6187899  1.0000000
     8:      C 2022-01-09 -0.1127343  0.3333333
     9:      C 2022-01-10  0.9170283  0.6666667
    10:      D 2022-01-08 -0.2232594  0.3333333
    11:      D 2022-01-09  0.5264481  0.6666667
    12:      D 2022-01-10 -0.7948444  0.0000000