Search code examples
rdata.tabletidyversedata-cleaning

Subset a data frame based on year column with the closest value below and farthest value above another year column


Subset dataframe based on PipeYear with the closest value below PropertyYearBuilt and farthest value above the PropertyYearBuilt using the following R code:

df <- read.table(text="
PipeID  PricePipe   PipeYear PropertyYearBuilt  Distance_to_property
    a       500         2010     2013               1.5
    b       600         2007     2008               2.5
    c       700         2009     2008               3.0
    d       800         1998     2000               4.2
    e       900         2003     2000               4.5
    f       200         2014     2013               5.0
    g       100         2011     2013               5.5
    h       850         2018     2008               7.0", header = TRUE)

Thanks!


Solution

  • Similar answer to one I posted here (if you have latest dplyr updates), but this time the furthest above is just the max grouped by PropertyID:

    library(tidyverse)
    
    df <- read.table(text="
    PipeID  PricePipe   PipeYear PropertyYearBuilt  Distance_to_property
        a       500         2010     2013               1.5
        b       600         2007     2008               2.5
        c       700         2009     2008               3.0
        d       800         1998     2000               4.2
        e       900         2003     2000               4.5
        f       200         2014     2013               5.0
        g       100         2011     2013               5.5
        h       850         2018     2008               7.0", header = TRUE) |> 
      mutate(PropertyID = as.numeric(as.factor(PropertyYearBuilt)))
    
    bind_rows(
      df |>
        select(PropertyYearBuilt, PropertyID) |>
        unique() |>
        left_join(
          df |> select(-PropertyYearBuilt),
          join_by(PropertyID, closest(PropertyYearBuilt >= PipeYear))
        ),
    df |> 
      group_by(PropertyYearBuilt) |> 
      filter(PipeYear == max(PipeYear))
    ) |> 
      arrange(PropertyID, PipeYear)
    #>   PropertyYearBuilt PropertyID PipeID PricePipe PipeYear Distance_to_property
    #> 1              2000          1      d       800     1998                  4.2
    #> 2              2000          1      e       900     2003                  4.5
    #> 3              2008          2      b       600     2007                  2.5
    #> 4              2008          2      h       850     2018                  7.0
    #> 5              2013          3      g       100     2011                  5.5
    #> 6              2013          3      f       200     2014                  5.0