Search code examples
rdplyracrosscoalescing

R Proper use for across() function with na.locf()


I am trying to replicate this SO question, but by using the updated syntax which uses the across() function and gets away from the deprecated summarise_all() and funs().

Starting Data

I have a database extract that one row per event type, like so:

library(tidyverse)
library(zoo)

df_start <- tibble(shipment = c(rep("A",4), rep("B",4)), 
             stop = rep(c(1,1,2,2), 2),
             arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
             depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
             arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
             depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)

> df_start
# A tibble: 8 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 NA                  NA                  NA                 
2 A            1 NA                  2021-01-01 08:40:00 NA                  NA                 
3 A            2 NA                  NA                  2021-01-05 10:00:00 NA                 
4 A            2 NA                  NA                  NA                  2021-01-05 11:30:00
5 B            1 2021-06-05 12:10:00 NA                  NA                  NA                 
6 B            1 NA                  2021-06-05 16:58:00 NA                  NA                 
7 B            2 NA                  NA                  2021-06-08 10:58:00 NA                 
8 B            2 NA                  NA                  NA                  2021-06-08 13:50:00

Desired Outcome

... and I want to collapse the number of rows by grouping by either shipments and stops, or even just by shipments (I'm not sure if leaving NA present in the final dataframe will affect the answer, but I'm seeking to be able to solve it either way).

df_finish1 # One desired outcome

# A tibble: 4 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00

df_finish2 # Second/alternative desired outcome

# A tibble: 2 x 5
  shipment arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dttm>              <dttm>              <dttm>              <dttm>             
1 A        2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B        2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

What I've researched and tried

Based on this SO question, which does work:

df_1 <- df_start %>% 
  group_by(shipment, stop) %>%   # Two groupings
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())
  
> df_1
# A tibble: 4 x 6
# Groups:   shipment, stop [4]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>% 
  group_by(shipment) %>%   # Single grouping
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())

> df_2
# A tibble: 2 x 6
# Groups:   shipment [2]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B            2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

But what I see is that the summarise_all() function and the funs() function are deprecated and not to be used going forward, so I am trying to understand how to use the across() function properly, but without success:

df_3 <- df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))

> df_3 <- df_start %>% 
+   group_by(shipment) %>% 
+   summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".

I've read through the vignette("colwise") which describe the differences and suggests I would just replace the syntax as shown above, but clearly I'm not getting it right. Help?


Solution

  • You have couple of syntax issues in the code.

    1 - The arguments .cols and .fns are inside across, in your code across function gets closed after everything() (across(everything())).

    1. When you are using . in across you need to prefix it with ~ to specify that you are using lambda expression for the function passed. (See .fns argument in ?across).

    Incorporating this changes you can use -

    library(dplyr)
    library(zoo)
    
    df_start %>% 
      group_by(shipment) %>% 
      summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
    

    However, across has everything() as default .cols argument and you can also apply the function without the need of ~, so another way to write this would be -

    df_start %>% 
      group_by(shipment) %>% 
      summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))