Search code examples
rdata.tablefuzzyjoin

R join two data.table with with exact on one column and fuzzy on second


I am working with two data.tables,

  1. predicted yields over age based on a variety of stand condition
  2. field measurements of yields at a particular field location, with a measured age

I would like to find the yield curve that best predicts the measured yield of the field data.

library (fuzzyjoin)
library(data.table)
library(ggplot2)

# set up some dummy yield curves
species <- c("a", "b") 
age <- seq(0:120)
s <- 10:12 # site difference
yield_db <- data.table(expand.grid(species=species, s=s, age=age))[
  order(species, age, s)]
yield_db[species=="a", yield := 1.5*age+age*s*3]
yield_db[species=="b", yield := 0.75*age+age*s*2]
yield_db[, yc_id := .GRP, by = .(species, s)] # add a unique identifier

# generate some measurements - just add some noise to some sample yields
set.seed(1)
num_rows <- 3  # Set the desired number of rows
measurement_db <- yield_db[age>20][sample(.N,num_rows)]
measurement_db[,yield:=yield+runif(num_rows, min=-40, max=40)]
measurement_db[,age:=age+round(runif(num_rows, min=-5, max=5),0)]

# Plot the "measurements" against "yields"
ggplot(data = yield_db, aes(x=age, y=yield, colour=as.factor(yc_id))) +
  geom_line() +
  geom_point(data=measurement_db, aes(x=age, y=yield), colour="orange")

# Join to nearest yield
res <- difference_left_join(
  measurement_db,
  yield_db,
  by=c("yield")
)

enter image description here

> res
  species.x s.x age.x  yield.x yc_id.x species.y s.y age.y yield.y yc_id.y
1         a  12    60 2375.364       3         b  12    96 2376.00       6
2         b  11    86 2035.079       5         a  11    59 2035.50       2
3         b  12    78 1845.943       6         b  10    89 1846.75       4
> 

What I would like to do is force the join to keep the age the same (i.e., age.x == age.y) and the same species (i.e., species.x == species.y), and find the closest matching yield curve.

Thank you


Solution

  • You could also use data.table's join/roll ability:

    library(data.table)
    setDT(yield_db)
    setDT(measurement_db)
    
    yield_db[measurement_db, on=c('species', 'age', 'yield'), roll='nearest']
    
       species     s   age    yield yc_id   i.s i.yc_id
        <fctr> <int> <int>    <num> <int> <int>   <int>
    1:       a    12    60 2375.364     3    12       3
    2:       b    11    86 2035.079     5    11       5
    3:       b    11    78 1845.943     5    12       6