Search code examples
rdataframerangeinner-joinsample

R: randomly sampling a dataframe based on another dataframe with a range


So I have 2 dataframes, one (dfA) with climate data that has over 100,000 rows, and one (dfB) as a list of sampling criteria (seed zone, altitude, and freq) with only about 50 rows.

What I'm aiming to do is use dfB as a basis to randomly sample rows from dfA based on shared columns (seed zone and altitude), picking n rows based on the freq column. This has been successful! The first code below is what I have managed to make work as far as I can tell, but it gives very limited results as the values have to match exactly and very few do.

Ideally, what I'd like to do, is have the sampling happen as above, but across a range of altitudes based on the dfB$Altitude value (say +-10)

Code for basic sampling (works):

sample <- dfA %>%
  inner_join(dfB) %>% 
  group_by(Seed_zone, Altitude) %>%
  sample_n(first(freq))

Code I've tried (in multiple variations) that hasn't worked

sample <- dfA %>%
  filter(dfA > dfB -10 & dfA < dfB +10) %>%
  left_join(dfB, by = c("Seed_zone")) %>%
  group_by(Seed_zone) %>%
  sample_n(first(freq))

dfA example data

Climate data Altitude Seed_zone
5.107707 305 101
4.845323 499 203
2.985152 7 101
22.17268 299 101

dfB example data

Altitude Seed_zone freq
300 101 2
502 203 1

Expected outcome (using +- 10)

Climate data Altitude Seed_zone freq
5.107707 305 101 2
4.845323 499 203 1
22.17268 299 101 2

Solution

  • We can use join_by. Note that when joining on an inequality, we cannot calculate on the fly, so we pre-calc the +/-0 10.

    library(dplyr)
    dfB %>%
      mutate(m10 = Altitude-10, p10 = Altitude+10) %>%
      inner_join(dfA, join_by(Seed_zone, m10 <= Altitude, p10 >= Altitude), suffix = c(".B", "")) %>%
      group_by(Seed_zone, Altitude.B) %>%
      sample_n(first(freq)) %>%
      ungroup()
    # # A tibble: 3 × 7
    #   Altitude.B Seed_zone  freq   m10   p10 `Climate data` Altitude
    #        <int>     <int> <int> <dbl> <dbl>          <dbl>    <int>
    # 1        300       101     2   290   310           5.11      305
    # 2        300       101     2   290   310          22.2       299
    # 3        502       203     1   492   512           4.85      499
    

    In this case, Altitude.B is from dfB (ergo why I intentionally change the suffix) and Altitude is the original from dfA (I thought it easier to keep this unchanged .. perhaps I should have used suffix = c(".B", ".A")).