Search code examples
rdata.tablenon-equi-join

Filtering by different ranges by group in data.table


I have some data in a table and I would like to do a non-equi join (I think is the right terminology) and filter it by different ranges for different groups. In the below example I would like to filter group "a" so that it only returns values between 1 and 20 (inclusive) and group "b" so it only returns values between 80 and 100 (inclusive). My reading suggests that inrange should be the item to use. I get how to use it in the generic case, but I am not sure how to get it to run with different ranges by group. (example code adapted from ?inrange)

create sample data

set.seed(1234)
Y = data.table(a=sample(1:100,100), val=runif(100), group=c(rep("a",50),rep("b",50)))
range = data.table(group=c("a","b"),start = c(1,80), end = c(20,100))

Try to filter

Y[inrange(a, range$start, range$end),,by=group]

This obviously does not work and instead applies those ranges to the entire dataset and throws the error message Ignoring by= because j= is not supplied. I think it is clear to me that this is not working because I haven't created the 'join' between the range table and Y, but I am not seeing how to make the two tables communicate grouping via inrange.

Note: In reality values in a will be posixct datetimes, but for the sake of simplicity I am not using that here.


Solution

  • Perhaps:

    Y[range, K := TRUE, on = .(group, a >= start, a <= end)][!is.na(K),]
    #         a        val  group      K
    #     <int>      <num> <char> <lgcl>
    #  1:     9 0.60189755      a   TRUE
    #  2:     5 0.99874081      a   TRUE
    #  3:    16 0.55512663      a   TRUE
    #  4:     4 0.42944396      a   TRUE
    #  5:    14 0.43101637      a   TRUE
    #  6:     3 0.47880269      a   TRUE
    #  7:     2 0.02220682      a   TRUE
    #  8:     6 0.63891131      a   TRUE
    #  9:     8 0.83470266      a   TRUE
    # 10:    17 0.98304402      a   TRUE
    # 11:    98 0.76785547      b   TRUE
    # 12:    94 0.30766574      b   TRUE
    # 13:    88 0.25814665      b   TRUE
    # 14:    89 0.49954639      b   TRUE
    # 15:    83 0.50892062      b   TRUE
    # 16:    95 0.49443856      b   TRUE
    # 17:    97 0.56695890      b   TRUE
    # 18:    87 0.98970989      b   TRUE
    # 19:    82 0.53190509      b   TRUE
    # 20:   100 0.59662376      b   TRUE
    #         a        val  group      K
    

    There are other ways to do this, but they involve renaming or loss of information. For instance,

    • left-join range and Y, we lose a:

      Y[range, on = .(group, a >= start, a <= end)]
      #         a        val  group   a.1
      #     <int>      <num> <char> <int>
      #  1:     1 0.60189755      a    20
      #  2:     1 0.99874081      a    20
      #  3:     1 0.55512663      a    20
      # ...
      # 18:    80 0.98970989      b   100
      # 19:    80 0.53190509      b   100
      # 20:    80 0.59662376      b   100
      #         a        val  group   a.1
      

      The fix is to copy Y$a into a new variable and join on it instead:

      Y[,a1 := a][range, on = .(group, a1 >= start, a1 <= end)]
      #         a        val  group    a1  a1.1
      #     <int>      <num> <char> <int> <int>
      #  1:     9 0.60189755      a     1    20
      #  2:     5 0.99874081      a     1    20
      #  3:    16 0.55512663      a     1    20
      # ...
      # 18:    87 0.98970989      b    80   100
      # 19:    82 0.53190509      b    80   100
      # 20:   100 0.59662376      b    80   100
      #         a        val  group    a1  a1.1
      
    • left-join Y and range, we get a duplicated into start and end but no clear indicator to filter on:

      range[Y, on = .(group, start <= a, end >= a)]
      #       group start   end        val
      #      <char> <int> <int>      <num>
      #   1:      a    28    28 0.85026492
      #   2:      a    80    80 0.23466126
      #   3:      a    22    22 0.98816745
      # ...
      #  98:      b    82    82 0.53190509
      #  99:      b   100   100 0.59662376
      # 100:      b    30    30 0.26388647
      #       group start   end        val
      

      A remedy would be to copy in another field that would give us the indicator of merge that we need to be able to filter. But even with that we have to rename to regain a's data:

      range[, K := TRUE][Y, on = .(group, start <= a, end >= a)][ !is.na(K), ]
      #      group start   end      K        val
      #     <char> <int> <int> <lgcl>      <num>
      #  1:      a     9     9   TRUE 0.60189755
      #  2:      a     5     5   TRUE 0.99874081
      #  3:      a    16    16   TRUE 0.55512663
      # ...
      # 18:      b    87    87   TRUE 0.98970989
      # 19:      b    82    82   TRUE 0.53190509
      # 20:      b   100   100   TRUE 0.59662376
      #      group start   end      K        val