Search code examples
rdatemultiple-columnsmatching

How to identify date within a specific range from multiple columns of a dataframe in R?


I have a very large dataframe with > 300 date columns (and >100,000 rows). Each row also contains an 'index date'. For each row, I want to identify whether any of the subsequent date columns contain a date which is within 6 months of the index date. I then want an output of a separate column which tells me WHICH of the date columns for that row contained the date which was within 6 months of the index date. I'm afraid I haven't really tried anything meaningful as am not sure where to start but I've provided a simplified example of what I am trying to achieve below. I would be very grateful for any ideas please!

Example dataframe

library(dplyr)

example <- data.frame(
'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>% 

# now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date

 mutate(start_range = (index_date - 182.6),
         end_range = (index_date + 182.6))

This produces the following dataframe

  index_date     date_1     date_2     date_3     date_4 start_range  end_range
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01  2005-07-02 2006-07-02
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02

Now what I would like to do is create a further column called matching_date which, for each row, tells me which of the date columns are within the date range. Note that row 5 contains two potential dates within range (date_3 and date_4), so ideally I also need to incorporate a function which identifies the date which is closest to the index date (in this case it would be date_3).

Desired output

  index_date     date_1     date_2     date_3     date_4 start_range  end_range matching_date
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01  2005-07-02 2006-07-02        date_1
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02        date_2
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01            NA
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02            NA
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02        date_3

Solution

  • Assuming that your records are uniquely identified by index_date, you may go with this:

    library(tidyverse)
    example <- data.frame(
    'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
    'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
    'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
    'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
    'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>% 
    
    # now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date
    
     mutate(start_range = (index_date - 182.6),
             end_range = (index_date + 182.6))
    
    example_long <- 
      example |> 
      pivot_longer(
        cols=starts_with("date"),
        names_to="vars",
        values_to="dates") |> 
      mutate(diff = abs(dates - index_date)) |> 
      rowwise() |> 
      mutate(matching_date = ifelse(between(dates, start_range, end_range), vars, NA)) |> 
      filter(!is.na(matching_date)) |> 
      group_by(index_date) |> 
      mutate(tie = which(diff == min(diff))) |> 
      filter(tie == row_number()) |> 
      select(index_date, matching_date)
    
    left_join(example, example_long)
    #> Joining, by = "index_date"
    #>   index_date     date_1     date_2     date_3     date_4 start_range  end_range
    #> 1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2013-03-01  2005-07-02 2006-07-02
    #> 2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02
    #> 3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01
    #> 4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02
    #> 5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02
    #>   matching_date
    #> 1        date_1
    #> 2        date_2
    #> 3          <NA>
    #> 4          <NA>
    #> 5        date_3
    

    Created on 2023-01-05 with reprex v2.0.2