Search code examples
rcountdata.tabletidyversesubset

r Compare a number to a tailored subset in the dataset


I would like to count how many rows having a larger number in a tailored subset. For example in the following dataset:

data = data.table(x=c(rep(1,8),rep(2,8)),y=c(rep(1:8,each=2)),z=c(1,2,3,4,5,4,3,2,1,2,3,4,5,4,3,4))

For each row i, I want to know, in the rows with x=x[i] and y<y[i], how many of them have z greater than z[i].

An expected output would be

c(0,0,0,0,0,0,3,4,0,0,0,0,0,0,3,1)

I can write a function for it and use apply to loop it over each row. However, the dataset has more than 30,000,000 rows which would take days. Is there a quicker way to calculate it, in R data.table or tidyverse or other packages?


Solution

  • We can use a non-equi join for this

    library(data.table)
    data[data, .N, on = .(x = x, y < y, z > z), by = .EACHI]
            x     y     z     N
        <num> <int> <num> <int>
     1:     1     1     1     0
     2:     1     1     2     0
     3:     1     2     3     0
     4:     1     2     4     0
     5:     1     3     5     0
     6:     1     3     4     0
     7:     1     4     3     3
     8:     1     4     2     4
     9:     2     5     1     0
    10:     2     5     2     0
    11:     2     6     3     0
    12:     2     6     4     0
    13:     2     7     5     0
    14:     2     7     4     0
    15:     2     8     3     3
    16:     2     8     4     1