Search code examples
rerror-handlingcase-when

Stopping execution if case_when finds a certain result


From here:

x <- 1:50
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)

How can I use stop() to stop execution as soon as I get an x divisible by 6? I do not want multiple error messages, just one.


Solution

  • If you look at the internals of case_when, you'll realize that it is executing each vector in order of the vectors, not in order of the data. That is, in your example, your first expression x %% 35 == 0 evaluates for the entire length of x before the function even looks at your second expression, x %% 5 == 0. It evaluates all of the expressions (LHS) and only then starts checking for matches. So there is no short-circuiting the evaluation of your data.

    If you're not worried about over-calculation and just want to truncate your data, then

    x <- 1:50
    ret <- case_when(
      x %% 35 == 0 ~ "fizz buzz",
      x %% 5 == 0 ~ "fizz",
      x %% 7 == 0 ~ "buzz",
      TRUE ~ as.character(x)
    )
    ret[!cumany(x %% 6 == 0)]
    # [1] "1"    "2"    "3"    "4"    "fizz"
    

    If you want/need to stay within the case_when call, then perhaps

    x <- 1:50
    ret <- case_when(
      cumany(x %% 6 == 0) ~ NA_character_,
      x %% 35 == 0 ~ "fizz buzz",
      x %% 5 == 0 ~ "fizz",
      x %% 7 == 0 ~ "buzz",
      TRUE ~ as.character(x)
    )
    ret
    #  [1] "1"    "2"    "3"    "4"    "fizz" NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
    # [18] NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
    # [35] NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
    
    ## filtered
    na.omit(ret)
    # [1] "1"    "2"    "3"    "4"    "fizz"
    # attr(,"na.action")
    #  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
    # [40] 45 46 47 48 49 50
    # attr(,"class")
    # [1] "omit"
    
    ## or more succinctly
    ret[!is.na(ret)]
    # [1] "1"    "2"    "3"    "4"    "fizz"
    

    for which you can then na.omit or similar. But again, this is not interrupting or short-circuiting, so absolutely no savings in execution time. (And no ability to feed an infinitely-large vector with the assumption that it'll interrupt itself when one of the %% 6 conditions is found.)

    In order to short-circuit evaluation of the LHS, you will need to break the vectorized evaluation of case_when and do everything yourself. Unfortunately, there is a "problem": the way that case_when maintains generality is by evaluating each LHS expression (and RHS, for that matter) in the context of the calling environment. For instance, it evaluates x %% 35 == 0 in the parent frame (i.e., the environment that called case_when). The case_when function has no idea a priori how long the returned vector will be, nor how to correctly index (step through) any and all of the variables present. For instance, it seems intuitive that with x %% 35 == 0 we would want to first check x[1] %% 35 == 0, then x[2] %% 35 == 0, etc ... but what about (x+y) %% 35 == 0 where y might be a vector of the same length, a recycled vector of a divisible length, or a single value, where (x[50]+y[50]) %% 35 == 0 would result in an NA (or index error).

    Here's a way to wrap around case_when and only evaluate until the constraint is found, then stop.

    x <- 1:50
    ret <- numeric(length(x))
    for (i in seq_along(x)) {
      # constraint
      if (x[i] %% 6 == 0) break
      # regular piece-wise execution
      ret[i] <- case_when(
        x[i] %% 35 == 0 ~ "fizz buzz",
        x[i] %% 5 == 0 ~ "fizz",
        x[i] %% 7 == 0 ~ "buzz",
        TRUE ~ as.character(x[i])
      )
    }
    ret <- head(ret, i - 1)
    ret
    # [1] "1"    "2"    "3"    "4"    "fizz"
    

    This loop iterates 6 times, where the first 5 are executed normally, and on the 6th iteration it finds that x[i] %% 6 == 0 is true, and breaks.


    I would be remiss if I didn't address the relative (in)efficiency of these methods.

    x <- 1:50
    microbenchmark::microbenchmark(
      one = {
        ret <- case_when(
          x %% 35 == 0 ~ "fizz buzz",
          x %% 5 == 0 ~ "fizz",
          x %% 7 == 0 ~ "buzz",
          TRUE ~ as.character(x)
        )
        ret[!cumany(x %% 6 == 0)]
      },
      two = {
        ret <- case_when(
          cumany(x %% 6 == 0) ~ NA_character_,
          x %% 35 == 0 ~ "fizz buzz",
          x %% 5 == 0 ~ "fizz",
          x %% 7 == 0 ~ "buzz",
          TRUE ~ as.character(x)
        )
        ret[!is.na(ret)]
      },
      three = {
        ret <- numeric(length(x))
        for (i in seq_along(x)) {
          # constraint
          if (x[i] %% 6 == 0) break
          # regular piece-wise execution
          ret[i] <- case_when(
            x[i] %% 35 == 0 ~ "fizz buzz",
            x[i] %% 5 == 0 ~ "fizz",
            x[i] %% 7 == 0 ~ "buzz",
            TRUE ~ as.character(x[i])
          )
        }
        ret <- head(ret, i - 1)
      }
    )
    # Unit: microseconds
    #   expr    min      lq     mean  median      uq     max neval
    #    one  136.6  143.55  168.975  152.60  167.55   478.3   100
    #    two  156.9  171.10  199.213  180.05  206.80   427.3   100
    #  three 4772.7 5336.75 5854.889 5605.25 6073.20 12001.8   100
    

    It should be clear here that with this set of expressions (a few modulus operations), in R it is far more efficient to calculate more than we need but use a vectorized approach than it is to try to limit how much of x we process.

    If you're curious, this third method is still not efficient when x is 500K long ...

    x <- 1:500
    # Unit: microseconds
    #   expr    min     lq     mean  median      uq     max neval
    #    one  216.9  245.5  287.715  261.55  289.20   601.4   100
    #    two  220.9  260.8  300.539  277.75  295.75   691.5   100
    #  three 5578.7 6164.9 6802.093 6531.20 6884.25 13667.9   100
    x <- 1:5000
    # Unit: microseconds
    #   expr    min      lq     mean  median      uq      max neval
    #    one 1468.2 1644.50 3809.862 1708.65 1879.90 196632.1   100
    #    two  780.9  852.25  986.799  889.90  952.45   6761.6   100
    #  three 8061.9 8785.15 9836.741 9249.85 9803.70  17088.5   100
    x <- 1:50000
    # Unit: milliseconds
    #   expr     min       lq     mean   median       uq      max neval
    #    one 15.9505 20.33195 26.18902 22.60755 26.75880 230.6372   100
    #    two  6.8114  8.33300 12.92443  8.95825 14.18375 236.4153   100
    #  three 34.6127 43.44130 48.28222 47.23290 53.26485  71.2169   100
    x <- 1:500000
    # Unit: milliseconds
    #   expr      min       lq     mean   median       uq      max neval
    #    one 201.1099 220.5286 278.7940 238.9214 280.8388 548.7299   100
    #    two  82.8113 104.9474 139.0557 118.3804 136.0794 380.3658   100
    #  three 295.7582 310.8903 335.8939 322.4250 349.4466 567.1915   100
    

    But finally reaches some parity at 5M:

    x <- 1:5000000
    # Unit: seconds
    #   expr      min       lq     mean   median       uq      max neval
    #    one 2.713632 2.794410 3.371636 3.175023 3.820303 4.682576    10
    #    two 1.105257 1.278336 1.535301 1.371372 1.854551 2.281774    10
    #  three 3.082974 3.116061 3.292641 3.314118 3.476838 3.513049    10
    

    (This will vary significantly as the computation-cost changes.)