I am trying to replicate this SO question, but by using the updated syntax which uses the across()
function and gets away from the deprecated summarise_all()
and funs()
.
I have a database extract that one row per event type, like so:
library(tidyverse)
library(zoo)
df_start <- tibble(shipment = c(rep("A",4), rep("B",4)),
stop = rep(c(1,1,2,2), 2),
arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)
> df_start
# A tibble: 8 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 NA NA NA
2 A 1 NA 2021-01-01 08:40:00 NA NA
3 A 2 NA NA 2021-01-05 10:00:00 NA
4 A 2 NA NA NA 2021-01-05 11:30:00
5 B 1 2021-06-05 12:10:00 NA NA NA
6 B 1 NA 2021-06-05 16:58:00 NA NA
7 B 2 NA NA 2021-06-08 10:58:00 NA
8 B 2 NA NA NA 2021-06-08 13:50:00
... and I want to collapse the number of rows by grouping by either shipments and stops, or even just by shipments (I'm not sure if leaving NA
present in the final dataframe will affect the answer, but I'm seeking to be able to solve it either way).
# A tibble: 4 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
# A tibble: 2 x 5
shipment arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dttm> <dttm> <dttm> <dttm>
1 A 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
Based on this SO question, which does work:
df_1 <- df_start %>%
group_by(shipment, stop) %>% # Two groupings
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_1
# A tibble: 4 x 6
# Groups: shipment, stop [4]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>%
group_by(shipment) %>% # Single grouping
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_2
# A tibble: 2 x 6
# Groups: shipment [2]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
But what I see is that the summarise_all()
function and the funs()
function are deprecated and not to be used going forward, so I am trying to understand how to use the across()
function properly, but without success:
df_3 <- df_start %>%
group_by(shipment) %>%
summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
> df_3 <- df_start %>%
+ group_by(shipment) %>%
+ summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".
I've read through the vignette("colwise")
which describe the differences and suggests I would just replace the syntax as shown above, but clearly I'm not getting it right. Help?
You have couple of syntax issues in the code.
1 - The arguments .cols
and .fns
are inside across
, in your code across
function gets closed after everything()
(across(everything())
).
.
in across
you need to prefix it with ~
to specify that you are using lambda expression for the function passed. (See .fns
argument in ?across
).Incorporating this changes you can use -
library(dplyr)
library(zoo)
df_start %>%
group_by(shipment) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
However, across
has everything()
as default .cols
argument and you can also apply the function without the need of ~
, so another way to write this would be -
df_start %>%
group_by(shipment) %>%
summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))