Search code examples
rdata.tablenon-equi-join

R: unequi join with merge function


I am working with data.table and I want to do a non-equi left join/merge.

I have one table with car prices and another table to identify which car class each car belongs to:

data_priceclass <- data.table()
data_priceclass$price_from <- c(0, 0, 200000, 250000, 300000, 350000, 425000, 500000, 600000, 700000, 800000, 900000, 1000000, 1100000, 1200000, 1300000, 1400000, 1500000, 1600000, 1700000, 1800000) 
data_priceclass$price_to <- c(199999, 199999, 249999, 299999, 349999, 424999, 499999, 599999, 699999, 799999, 899999, 999999, 1099999, 1199999, 1299999, 1399999, 1499999, 1599999, 1699999, 1799999, 1899999)
data_priceclass$price_class <- c(1:20, 99)

I use a non-equi join to merge the two tables. But the x[y]-join syntax of data.table removes duplicates.

cars <- data.table(car_price = c(190000, 500000))
cars[data_priceclass, on = c("car_price >= price_from", 
                             "car_price < price_to"),
     price_class := i.price_class,]
cars

Notice that the car with value 190000 is supposed to get matches on two rows in the data_priceclass table, but since x[y] removes duplicates, I can't see this in the output. Normally when I join I always use the merge function instead of x[y], because I'm losing control when I use x[y].

But the following does not work with non-equi joins:

merge(cars, data_priceclass,
      by = c("car_price >= price_from", 
             "car_price < price_to"),
      all.x = T , all.y = F)

Any tips how I can do a non-equi join with data.table that does not remove duplicates?


Solution

  • As noted in comments, a left join on cars is done by using cars as subsetting condition i in the DT[i,j,by] syntax.
    This puts cars on the right, which might be counter-intuitive compared to SQL, and I found this tutorial useful to compare both syntaxes.

    cars <- data.table(car_price = c(190000, 500000))
    data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]
    
       car_price x.price_from x.price_to price_class
    1:    190000        0e+00     199999           1
    2:    190000        0e+00     199999           2
    3:    500000        5e+05     599999           8
    
    

    If you increase car price:

    cars <- cars * 10
    data_priceclass[cars, .(car_price,x.price_from,x.price_to,price_class),on = .(price_from <= car_price,price_to > car_price)]
    
       car_price x.price_from x.price_to price_class
    1:   1900000           NA         NA          NA
    2:   5000000           NA         NA          NA