Search code examples
rdplyrpipetidyversemagrittr

Print message in the middle of long pipe


In my R script, I would like to print a message in the middle of a long pipe (which takes a long time to run) without breaking the pipe. I need this to track the progress of the running pipeline.

I tried to create a simple function that prints a message without breaking the pipe:

CatPipe <- function(x, message = NULL) {
  cat(message, sep = "\n")
  return(x)
}

Here is an example implementation:

require(dplyr)
mtcars %>% 
  CatPipe("Hello1") %>% 
  {
    Sys.sleep(1)
    .
  } %>% 
  CatPipe("Hello2") %>% 
  {
    Sys.sleep(1)
    .
     } %>% 
    head(5) %>% 
    CatPipe("Hello3")

However, the printed messages were not as expected. The messages are printed in reverse order, irrespective if the current step was finished or not!

Hello3
Hello2
Hello1
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Is there an explanation for this? Should I make any changes to the function? Is there an alternative for this using dplyr, magrittr, or similar packages?


Solution

  • The %>% pipe uses lazy evaluation, which means that it doesn't evaluate expressions until it needs to. In practice, this means that it reaches the end of the pipe and then works backward ... the magrittr package defines an eager pipe operator %!>% which has the behaviour you want (%>% actually comes from magrittr as well; it's imported and re-exported by dplyr).

    In order to print the mtcars output before Hello3, you can use a tee pipe (%T>%), also from magrittr, plus a print() statement. The reason the output usually prints at the end is that R only prints the final result of an expression once it has been completely evaluated. (I also added a -> result to the end of the pipe to store the result and prevent it from being printed again at the end; I added a SleepPipe() function similar to your CatPipe() to make the code a little bit prettier.)

    library(magrittr)
    SleepPipe <- function(x, time = 1) {
        Sys.sleep(time)
        return(x)
    }
    mtcars %!>% 
      CatPipe("Hello1") %!>% 
      SleepPipe(1) %!>% 
      CatPipe("Hello2") %!>% 
      SleepPipe(1) %!>%
      head(5) %T>% print() %!>% 
      CatPipe("Hello3") -> result
    
    • I can't guarantee that changing from lazy to eager evaluation won't have other side effects in a complicated pipeline
    • the native-R pipe |> also uses lazy evaluation (to test it I had to use the SleepPipe() function, since the native pipe doesn't allow arbitrary {.} expressions)
    • FWIW R for Data Science contradicts this answer: it says that functions that use lazy evaluation won't work in piped workflows because "The pipe computes each element in turn, so you can’t rely on this behaviour". I think this is because magrittr re-implemented the pipe to use lazy evaluation in version 2.0 ....