Search code examples
rdataframedateapplyplyr

Taking minimum difference of dates between two data frame for the same id


my problem is simple. I have 2 data frame, each of them has a a column of dates (%Y-%m-%d), and a column for the ids. One has just one id per row, the other one has multiple rows for the same id. I want to take the value so that it presents the minumum difference for the dates. Now I explain better with an example:

df1 (single values for colA):

+-------+------------+------+------+-------+-------+
| colA  |    colB    | colC | colD | colE  | colF  |
+-------+------------+------+------+-------+-------+
| 3000  | 2011-01-20 |    2 | 3.43 | 2.01  | 1.63  |
| 3001  | 2012-04-06 |    1 | 1.12 | -0.63 | -1.16 |
| 3002  | 2012-04-24 |    2 | 2.28 | -0.18 | -0.12 |
| 3003  | 2012-04-13 |    2 | 1.27 | -0.51 | -0.82 |
| 3004  | 2011-08-24 |    5 | 5.30 | 2.68  | 2.10  |
| 3006  | 2011-08-02 |    2 | 2.12 | -0.27 | -2.60 |
+-------+------------+------+------+-------+-------+

df2 (multiple values for first column (X)):

+------+---------------+----------+
| colX |     colY      | colZ     |
+------+---------------+----------+
| 3000 | 2011-02-01    |        0 |
| 3000 | 2012-03-01    |        0 |
| 3000 | 2013-02-01    |        0 |
| 3000 | 2014-03-01    |        1 |
| 3000 | 2015-03-01    |        0 |
| 3000 | 2016-04-01    |        0 |
| 3002 | 2011-03-01    |        1 |
| 3002 | 2011-08-01    |        1 |
| 3002 | 2012-04-01    |        0 |
+------+---------------+----------+

In this case I see the first value in colA (df1) and compute all the differences in months between 2011-01-20 with all of the dates for 3000 in df2 (2011-02-01, 2012-03-01,ecc), so the first 6 rows. I take only the minimum difference, so in this case is the first one (2011-02-01) that is almost one month. So at the end I should have df1 with 3 new columns (Y and Z and diff) so the minimum date on df2, 0/1 of Z and the difference of the 2 date in days.

e.g. for 3000 (for the difference I take the abs):

3000  2011-01-20  2  3.43  2.01  1.63  2011-02-01 0 12

What function should I use? apply? ddply?

Thanks in advance


Solution

  • You can try this (just be careful of how you define the date operations as this is not clear in your question):

    library(tidyverse)
    library(lubridate)
    
    #Data
    df1 <- structure(list(colA = c(3000L, 3001L, 3002L, 3003L, 3004L, 3006L
    ), colB = c("2011-01-20", "2012-04-06", "2012-04-24", "2012-04-13", 
    "2011-08-24", "2011-08-02"), colC = c(2L, 1L, 2L, 2L, 5L, 2L), 
        colD = c(3.43, 1.12, 2.28, 1.27, 5.3, 2.12), colE = c(2.01, 
        -0.63, -0.18, -0.51, 2.68, -0.27), colF = c(1.63, -1.16, 
        -0.12, -0.82, 2.1, -2.6)), class = "data.frame", row.names = c(NA, 
    -6L))
    df2 <- structure(list(colX = c(3000L, 3000L, 3000L, 3000L, 3000L, 3000L, 
    3002L, 3002L, 3002L), colY = c("2011-02-01", "2012-03-01", "2013-02-01", 
    "2014-03-01", "2015-03-01", "2016-04-01", "2011-03-01", "2011-08-01", 
    "2012-04-01"), colZ = c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, 
    -9L))
    
    #Code
    #Compute
    dfo <- df2 %>% rename(colA=colX) %>% left_join(df1) %>% 
      mutate(Diff=abs(12*(year(as.Date(colB))-year(as.Date(colY)))+month(as.Date(colB))-month(as.Date(colY))),
             Diffdays=abs(as.Date(colB)-as.Date(colY))) %>% group_by(colA) %>%
      filter(Diff==min(Diff))
    #Format
    vars <- c(names(df1),names(df2)[-1],'Diff','Diffdays')
    #Data
    dfo %>% select(vars)
    
    # A tibble: 2 x 10
    # Groups:   colA [2]
       colA colB        colC  colD  colE  colF colY        colZ  Diff Diffdays
      <int> <chr>      <int> <dbl> <dbl> <dbl> <chr>      <int> <dbl> <drtn>  
    1  3000 2011-01-20     2  3.43  2.01  1.63 2011-02-01     0     1 12 days 
    2  3002 2012-04-24     2  2.28 -0.18 -0.12 2012-04-01     0     0 23 days 
    

    Please check if this fits to what you want.