So I have 2 dataframes, one (dfA) with climate data that has over 100,000 rows, and one (dfB) as a list of sampling criteria (seed zone, altitude, and freq) with only about 50 rows.
What I'm aiming to do is use dfB as a basis to randomly sample rows from dfA based on shared columns (seed zone and altitude), picking n rows based on the freq column. This has been successful! The first code below is what I have managed to make work as far as I can tell, but it gives very limited results as the values have to match exactly and very few do.
Ideally, what I'd like to do, is have the sampling happen as above, but across a range of altitudes based on the dfB$Altitude value (say +-10)
Code for basic sampling (works):
sample <- dfA %>%
inner_join(dfB) %>%
group_by(Seed_zone, Altitude) %>%
sample_n(first(freq))
Code I've tried (in multiple variations) that hasn't worked
sample <- dfA %>%
filter(dfA > dfB -10 & dfA < dfB +10) %>%
left_join(dfB, by = c("Seed_zone")) %>%
group_by(Seed_zone) %>%
sample_n(first(freq))
dfA example data
Climate data | Altitude | Seed_zone |
---|---|---|
5.107707 | 305 | 101 |
4.845323 | 499 | 203 |
2.985152 | 7 | 101 |
22.17268 | 299 | 101 |
dfB example data
Altitude | Seed_zone | freq |
---|---|---|
300 | 101 | 2 |
502 | 203 | 1 |
Expected outcome (using +- 10)
Climate data | Altitude | Seed_zone | freq |
---|---|---|---|
5.107707 | 305 | 101 | 2 |
4.845323 | 499 | 203 | 1 |
22.17268 | 299 | 101 | 2 |
We can use join_by
. Note that when joining on an inequality, we cannot calculate on the fly, so we pre-calc the +/-0 10.
library(dplyr)
dfB %>%
mutate(m10 = Altitude-10, p10 = Altitude+10) %>%
inner_join(dfA, join_by(Seed_zone, m10 <= Altitude, p10 >= Altitude), suffix = c(".B", "")) %>%
group_by(Seed_zone, Altitude.B) %>%
sample_n(first(freq)) %>%
ungroup()
# # A tibble: 3 × 7
# Altitude.B Seed_zone freq m10 p10 `Climate data` Altitude
# <int> <int> <int> <dbl> <dbl> <dbl> <int>
# 1 300 101 2 290 310 5.11 305
# 2 300 101 2 290 310 22.2 299
# 3 502 203 1 492 512 4.85 499
In this case, Altitude.B
is from dfB
(ergo why I intentionally change the suffix) and Altitude
is the original from dfA
(I thought it easier to keep this unchanged .. perhaps I should have used suffix = c(".B", ".A")
).