I am working with two data.tables,
I would like to find the yield curve that best predicts the measured yield of the field data.
library (fuzzyjoin)
library(data.table)
library(ggplot2)
# set up some dummy yield curves
species <- c("a", "b")
age <- seq(0:120)
s <- 10:12 # site difference
yield_db <- data.table(expand.grid(species=species, s=s, age=age))[
order(species, age, s)]
yield_db[species=="a", yield := 1.5*age+age*s*3]
yield_db[species=="b", yield := 0.75*age+age*s*2]
yield_db[, yc_id := .GRP, by = .(species, s)] # add a unique identifier
# generate some measurements - just add some noise to some sample yields
set.seed(1)
num_rows <- 3 # Set the desired number of rows
measurement_db <- yield_db[age>20][sample(.N,num_rows)]
measurement_db[,yield:=yield+runif(num_rows, min=-40, max=40)]
measurement_db[,age:=age+round(runif(num_rows, min=-5, max=5),0)]
# Plot the "measurements" against "yields"
ggplot(data = yield_db, aes(x=age, y=yield, colour=as.factor(yc_id))) +
geom_line() +
geom_point(data=measurement_db, aes(x=age, y=yield), colour="orange")
# Join to nearest yield
res <- difference_left_join(
measurement_db,
yield_db,
by=c("yield")
)
> res
species.x s.x age.x yield.x yc_id.x species.y s.y age.y yield.y yc_id.y
1 a 12 60 2375.364 3 b 12 96 2376.00 6
2 b 11 86 2035.079 5 a 11 59 2035.50 2
3 b 12 78 1845.943 6 b 10 89 1846.75 4
>
What I would like to do is force the join to keep the age the same (i.e., age.x == age.y) and the same species (i.e., species.x == species.y), and find the closest matching yield curve.
Thank you
You could also use data.table's join/roll ability:
library(data.table)
setDT(yield_db)
setDT(measurement_db)
yield_db[measurement_db, on=c('species', 'age', 'yield'), roll='nearest']
species s age yield yc_id i.s i.yc_id
<fctr> <int> <int> <num> <int> <int> <int>
1: a 12 60 2375.364 3 12 3
2: b 11 86 2035.079 5 11 5
3: b 11 78 1845.943 5 12 6