Search code examples
rdata.tablemeanquantilesubsampling

R Obtain quantile and mean from a tailored subset in the dataset


I would like to obtain quantile in a tailored subset. For example in the following dataset:

data = data.table(x=c(rep(1,9),rep(2,9)),y=c(rep(1:6,each=3)),z=1:18)
  1. For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the 50%tile (as well as other quantiles in further calculations, e.g. 10%tile, 5%tile) of z[i].

An expected output would be

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)
  1. For each row i, I want to know, in the rows with x=x[i] and y=<y[i], the mean of z[i].

An expected output would be (same as 1 in this dataset, but would be different in other datasets).

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)

I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?


Solution

  • Use a non-equi join in data.table

    data[data, quantile(z, 0.5), on = .(x = x, y <=y), by = .EACHI]$V1
    #[1]  2.0  2.0  2.0  3.5  3.5  3.5  5.0  5.0  5.0 11.0 11.0 11.0 12.5 12.5 12.5 14.0 14.0 14.0
    

    If we want to create a column

    data[data[unique(data[, .(x, y)]), quantile(z, 0.5), 
      on = .(x = x, y <=y), by = .EACHI], z_mean := V1, on = .(x, y)]
    

    -output

    > data
            x     y     z z_mean
        <num> <int> <int>  <num>
     1:     1     1     1    2.0
     2:     1     1     2    2.0
     3:     1     1     3    2.0
     4:     1     2     4    3.5
     5:     1     2     5    3.5
     6:     1     2     6    3.5
     7:     1     3     7    5.0
     8:     1     3     8    5.0
     9:     1     3     9    5.0
    10:     2     4    10   11.0
    11:     2     4    11   11.0
    12:     2     4    12   11.0
    13:     2     5    13   12.5
    14:     2     5    14   12.5
    15:     2     5    15   12.5
    16:     2     6    16   14.0
    17:     2     6    17   14.0
    18:     2     6    18   14.0