Search code examples
rdplyrdbplyr

Dbplyr count unique (n_distinct) with case_when or if_else


I want to create a new column counting unique dates (dplyr::n_distinct) with dplyr::case_when function using a dplyr::tbl df object.

I tried the following code:

df %>%
group_by(id) %>%
mutate(
    last_date = max(date),
    distinct_date_2020 = case_when(
            date >= "2020-01-01" & date <= "2020-12-31" ~ n_distinct(date[!type == "Online"]),
            TRUE ~ as.integer(0))
) %>%
ungroup() 

The mutate of distinct_date_2020 does not work. I receive the following error message:

[Microsoft][ODBC SQL Server Driver][SQL Server]Use of DISTINCT is not allowed with the OVER clause.

If I run without the distinct_date_2020, it works:

df %>%
group_by(id) %>%
mutate(
    last_date = max(date)
) %>%
ungroup() 

I also tried these variants, without success:

distinct_date_2020 = n_distinct(date[date >= "2020-01-01" & date <= "2020-12-31" & !type == "Online"])

and

distinct_date_2020 = ifelse(date >= "2020-01-01" & date <= "2020-12-31", n_distinct(date[!type == "Online"]), NA)

If I run these after collect(), it works. But I want to send this as a SQL command to the server.

DBI::dbGetInfo:

$dbms.name [1] "Microsoft SQL Server"

$db.version [1] "11.00.6523"

Anyone know how to solve this? Thanks!


Solution

  • I finally got it working. I first grouped by id, extracted the last date and then grouped by year (using DATEADD and DATEDIFF, avoiding the lubridate package because of its lack of compatibility with the dbplyr package). Summarizing and putting the condition [type != "Online"] between brackets also worked. I got the desired results with the code below:

    df %>%
        group_by(id) %>%
            mutate(
                last_date_temp = max(date)
            ) %>%
        group_by(year = DATEADD(sql("year"), DATEDIFF(sql("year"), 0, date), 0), id) %>%
        summarize(
            last_date = max(last_date_temp, na.rm = TRUE),
            distinct_date = n_distinct(date[type != "Online"]),
        ) %>%
        ungroup() %>%
        select(id, year, everything()) %>%
        arrange(id, year) %>%
        collect()