I want to create a new column counting unique dates (dplyr::n_distinct
) with dplyr::case_when
function using a dplyr::tbl
df object.
I tried the following code:
df %>%
group_by(id) %>%
mutate(
last_date = max(date),
distinct_date_2020 = case_when(
date >= "2020-01-01" & date <= "2020-12-31" ~ n_distinct(date[!type == "Online"]),
TRUE ~ as.integer(0))
) %>%
ungroup()
The mutate
of distinct_date_2020
does not work. I receive the following error message:
[Microsoft][ODBC SQL Server Driver][SQL Server]Use of DISTINCT is not allowed with the OVER clause.
If I run without the distinct_date_2020
, it works:
df %>%
group_by(id) %>%
mutate(
last_date = max(date)
) %>%
ungroup()
I also tried these variants, without success:
distinct_date_2020 = n_distinct(date[date >= "2020-01-01" & date <= "2020-12-31" & !type == "Online"])
and
distinct_date_2020 = ifelse(date >= "2020-01-01" & date <= "2020-12-31", n_distinct(date[!type == "Online"]), NA)
If I run these after collect()
, it works. But I want to send this as a SQL command to the server.
DBI::dbGetInfo
:
$dbms.name [1] "Microsoft SQL Server"
$db.version [1] "11.00.6523"
Anyone know how to solve this? Thanks!
I finally got it working. I first grouped by id
, extracted the last date and then grouped by year (using DATEADD
and DATEDIFF
, avoiding the lubridate
package because of its lack of compatibility with the dbplyr
package). Summarizing and putting the condition [type != "Online"]
between brackets also worked. I got the desired results with the code below:
df %>%
group_by(id) %>%
mutate(
last_date_temp = max(date)
) %>%
group_by(year = DATEADD(sql("year"), DATEDIFF(sql("year"), 0, date), 0), id) %>%
summarize(
last_date = max(last_date_temp, na.rm = TRUE),
distinct_date = n_distinct(date[type != "Online"]),
) %>%
ungroup() %>%
select(id, year, everything()) %>%
arrange(id, year) %>%
collect()