Search code examples
sqlrdateimpaladbplyr

How to calculate difference in month in dbplyr (R and sql)


I tried to calculate the difference between dates in months for a huge database, using dbplyr package which sends my codes to the sql query. I tried different ways suggested in similar questions, but they didn't work (for example the codes below). Data2 is an object created by tbl()

Data2 %>% mutate(age_Months = (lubridate::interval(date_1, date_2)) / base::months(1))

#OR

library(lubridate)
Data2 %>% mutate(age_Months = TIMESTAMPDIFF(MONTH, date_1, date_2))

Do you have any ideas to calculate the differences? Feel free to suggest any other libraries and functions, it's not matter. I just need the differences in month. Thank you

The data set is as below.

Data2 %>% select(date_1, date_2) %>% head() %>% collect()

#  date_1                date_2               
#  <chr>                 <chr>                
#1 2015-05-01 00:00:00.0 2036-02-01 00:00:00.0
#2 2015-05-01 00:00:00.0 2036-01-01 00:00:00.0
#3 2015-05-01 00:00:00.0 2031-03-01 00:00:00.0
#4 2015-05-01 00:00:00.0 2035-12-01 00:00:00.0
#5 2015-05-01 00:00:00.0 2035-05-01 00:00:00.0
#6 2015-05-01 00:00:00.0 2032-03-01 00:00:00.0

I transformed them by as.Date() but no difference happened in result.

Data2 %>% mutate(date_1 = as.Date(date_1),
                 date_2 = as.Date(date_2))
#  date_1     date_2    
#  <date>     <date>    
#1 2018-08-01 2036-02-01
#2 2018-08-01 2036-06-01
#3 2018-08-01 2036-01-01
#4 2018-08-01 2034-08-01
#5 2018-08-01 2033-08-01
#6 2018-08-01 2035-03-01

Solution

  • Here are two possible approaches.

    1) Using lubridate

    dbplyr translations are defined for some (but not all) lubridate functions into some (but not all) SQL flavors (when last I tested). So I would start by trying the following:

    Data2 %>%
      mutate(year1 = lubridate::year(date_1),
             year2 = lubridate::year(date_2),
             month1 = lubridate::month(date_1),
             month2 = lubridate::month(date_2)) %>%
      mutate(age_months = 12*(year2 - year1) + month2 - month1) %>%
      select(-year1, -year2, -month1, -month2)
    
    2) Using untranslated SQL functions

    Impala includes a DATEDIFF function (documentation here). When dbplyr encounters a comand that it does not have a translation defined for, it passes the command untranslated through to the database.

    So something like the following:

    Data2 %>%
      mutate(age_months = DATEDIFF(date_2, date_1) / 30)
    

    should be translated into SQL like this:

    SELECT *
        ,DATEDIFF(date_2, date_1) / 30 AS age_months
    FROM data2