I am working with thermal requirements for crop growth. I have a table which contains cumulative temperature for a 6-month time period. Sample seen below:
date temp cum_temp
1: 2020-03-01 9.339748 9.339748
2: 2020-03-02 23.860849 33.200597
3: 2020-03-03 12.860331 46.060928
4: 2020-03-04 26.607505 72.668432
5: 2020-03-05 28.273551 100.941984
6: 2020-03-06 2.321138 103.263122
7: 2020-03-07 16.315059 119.578181
8: 2020-03-08 26.880152 146.458334
9: 2020-03-09 16.991615 163.449949
10: 2020-03-10 14.241827 177.691776
11: 2020-03-11 28.748167 206.439943
12: 2020-03-12 14.146691 220.586634
13: 2020-03-13 20.649548 241.236182
14: 2020-03-14 17.606369 258.842551
15: 2020-03-15 3.984816 262.827367
Then, I also have a table with a list of crop growth stages and their thermal requirements (i.e. the thermal thresholds needed to reach each stage):
growth_stage thermal_req
1: VE 120
2: V2 200
3: V3 350
4: V5-V6 475
5: V7-V9 610
6: R2 1660
7: R4 1925
8: R5 2450
9: R6 2700
Based on those tables, I need two outcomes:
growth_stage thermal_req date_reached
1: VE 120 2020-03-08
2: V2 200 2020-03-11
3: V3 350 2020-03-21
4: V5-V6 475 2020-03-26
5: V7-V9 610 2020-04-03
6: R2 1660 2020-06-14
7: R4 1925 2020-06-30
8: R5 2450 2020-08-06
9: R6 2700 2020-08-23
date temp cum_temp growth_stage
1: 2020-03-01 9.339748 9.339748 NA
2: 2020-03-02 23.860849 33.200597 NA
3: 2020-03-03 12.860331 46.060928 NA
4: 2020-03-04 26.607505 72.668432 NA
5: 2020-03-05 28.273551 100.941984 NA
6: 2020-03-06 2.321138 103.263122 NA
7: 2020-03-07 16.315059 119.578181 NA
8: 2020-03-08 26.880152 146.458334 VE
9: 2020-03-09 16.991615 163.449949 VE
10: 2020-03-10 14.241827 177.691776 VE
11: 2020-03-11 28.748167 206.439943 V2
12: 2020-03-12 14.146691 220.586634 V2
13: 2020-03-13 20.649548 241.236182 V2
14: 2020-03-14 17.606369 258.842551 V2
15: 2020-03-15 3.984816 262.827367 V2
16: 2020-03-16 27.094924 289.922291 V2
17: 2020-03-17 8.136544 298.058835 V2
18: 2020-03-18 2.219726 300.278562 V2
19: 2020-03-19 10.509701 310.788263 V2
20: 2020-03-20 28.680606 339.468868 V2
21: 2020-03-21 26.796640 366.265509 V3
22: 2020-03-22 21.091299 387.356807 V3
23: 2020-03-23 19.574698 406.931505 V3
24: 2020-03-24 29.833824 436.765328 V3
25: 2020-03-25 20.015468 456.780797 V3
26: 2020-03-26 21.547384 478.328180 V5-V6
27: 2020-03-27 16.777915 495.106095 V5-V6
28: 2020-03-28 18.230119 513.336214 V5-V6
29: 2020-03-29 9.385632 522.721846 V5-V6
30: 2020-03-30 5.266296 527.988142 V5-V6
31: 2020-03-31 28.927703 556.915844 V5-V6
32: 2020-04-01 27.166672 584.082517 V5-V6
33: 2020-04-02 21.030453 605.112970 V5-V6
34: 2020-04-03 24.068555 629.181525 V5-V6
35: 2020-04-04 1.713797 630.895322 V5-V6
36: 2020-04-05 14.856083 645.751405 V5-V6
37: 2020-04-06 22.995327 668.746732 V5-V6
38: 2020-04-07 7.275830 676.022562 V5-V6
39: 2020-04-08 10.227249 686.249811 V5-V6
40: 2020-04-09 7.717148 693.966959 V5-V6
date temp cum_temp growth_stage
What is the best way to achieve these outcomes?
Data used to reproduce this problem:
# load required packages
library(data.table)
# generate data
dates <- seq(as.Date("2020-03-01"), as.Date("2020-08-31"), by="days")
set.seed(123); temps <- runif(length(dates), min=1, max=30)
dat <- data.table(date=dates,
temp=temps)
# cumulative sum
dat$cum_temp <- cumsum(dat$temp)
# table with growth stage thermal requirements
sum_req <- data.table(growth_stage=c("VE","V2","V3","V5-V6","V7-V9","R2","R4","R5","R6"),
thermal_req=c(120,200,350,475,610,1660,1925,2450,2700))
Up front, all working code in one block, no output:
dat[, nextdate := shift(date, type="lead", fill=last(date))]
sum_req[dat, date_reached := i.nextdate, on=.(thermal_req > cum_temp)]
dat[, nextdate := NULL] # cleanup
dat[sum_req, growth_stage := i.growth_stage, on = .(date == date_reached), roll = TRUE
][, growth_stage := zoo::na.locf(growth_stage, na.rm = FALSE)]
First, I'll add a nextdate
to dat
so that the inherent internal (rolling?) join finds the correct date. Then we'll join and assign this to sum_req
:
dat[, nextdate := shift(date, type="lead", fill=last(date))]
sum_req[dat, date_reached := i.nextdate, on=.(thermal_req > cum_temp)]
# growth_stage thermal_req date_reached
# <char> <num> <Date>
# 1: VE 120 2020-03-08
# 2: V2 200 2020-03-11
# 3: V3 350 2020-03-21
# 4: V5-V6 475 2020-03-26
# 5: V7-V9 610 2020-04-03
# 6: R2 1660 2020-06-14
# 7: R4 1925 2020-06-30
# 8: R5 2450 2020-08-06
# 9: R6 2700 2020-08-23
dat[, nextdate := NULL] # cleanup
Then I'll do a rolling join,
dat[sum_req, growth_stage := i.growth_stage, on = .(date == date_reached), roll = TRUE]
# date temp cum_temp growth_stage
# <Date> <num> <num> <char>
# 1: 2020-03-01 9.33975 9.33975 <NA>
# 2: 2020-03-02 23.86085 33.20060 <NA>
# 3: 2020-03-03 12.86033 46.06093 <NA>
# 4: 2020-03-04 26.60750 72.66843 <NA>
# 5: 2020-03-05 28.27355 100.94198 <NA>
# 6: 2020-03-06 2.32114 103.26312 <NA>
# 7: 2020-03-07 16.31506 119.57818 <NA>
# 8: 2020-03-08 26.88015 146.45833 VE
# 9: 2020-03-09 16.99162 163.44995 <NA>
# 10: 2020-03-10 14.24183 177.69178 <NA>
# ---
# 175: 2020-08-22 20.38025 2699.72879 <NA>
# 176: 2020-08-23 18.92252 2718.65131 R6
# 177: 2020-08-24 11.79490 2730.44621 <NA>
# 178: 2020-08-25 16.36523 2746.81145 <NA>
# 179: 2020-08-26 26.36579 2773.17724 <NA>
# 180: 2020-08-27 17.87075 2791.04799 <NA>
# 181: 2020-08-28 25.35327 2816.40126 <NA>
# 182: 2020-08-29 10.06100 2826.46225 <NA>
# 183: 2020-08-30 21.54042 2848.00267 <NA>
# 184: 2020-08-31 8.68552 2856.68819 <NA>
and then LOCF the new growth_stage
. Here I'm using zoo::na.locf
since data.table::nafill
requires numeric/integer.
dat[, growth_stage := zoo::na.locf(growth_stage, na.rm = FALSE)]
# date temp cum_temp growth_stage
# <Date> <num> <num> <char>
# 1: 2020-03-01 9.33975 9.33975 <NA>
# 2: 2020-03-02 23.86085 33.20060 <NA>
# 3: 2020-03-03 12.86033 46.06093 <NA>
# 4: 2020-03-04 26.60750 72.66843 <NA>
# 5: 2020-03-05 28.27355 100.94198 <NA>
# 6: 2020-03-06 2.32114 103.26312 <NA>
# 7: 2020-03-07 16.31506 119.57818 <NA>
# 8: 2020-03-08 26.88015 146.45833 VE
# 9: 2020-03-09 16.99162 163.44995 VE
# 10: 2020-03-10 14.24183 177.69178 VE
# ---
# 175: 2020-08-22 20.38025 2699.72879 R5
# 176: 2020-08-23 18.92252 2718.65131 R6
# 177: 2020-08-24 11.79490 2730.44621 R6
# 178: 2020-08-25 16.36523 2746.81145 R6
# 179: 2020-08-26 26.36579 2773.17724 R6
# 180: 2020-08-27 17.87075 2791.04799 R6
# 181: 2020-08-28 25.35327 2816.40126 R6
# 182: 2020-08-29 10.06100 2826.46225 R6
# 183: 2020-08-30 21.54042 2848.00267 R6
# 184: 2020-08-31 8.68552 2856.68819 R6
For easy viewing, I'll show the change-points:
dat[, .SD[1,], by = .(growth_stage)]
# growth_stage date temp cum_temp
# <char> <Date> <num> <num>
# 1: <NA> 2020-03-01 9.33975 9.33975
# 2: VE 2020-03-08 26.88015 146.45833
# 3: V2 2020-03-11 28.74817 206.43994
# 4: V3 2020-03-21 26.79664 366.26551
# 5: V5-V6 2020-03-26 21.54738 478.32818
# 6: V7-V9 2020-04-03 24.06856 629.18152
# 7: R2 2020-06-14 26.82016 1660.54649
# 8: R4 2020-06-30 10.27480 1927.39382
# 9: R5 2020-08-06 13.89827 2459.84312
# 10: R6 2020-08-23 18.92252 2718.65131
(I think your sample output missed the transition to V7-V9
?)