Search code examples
rdplyrmapplyrep

Weird behavior of mapply with rep and dplyr pipes in R


I am dealing with strings having two separators "*" and "|", and they are used in strings such as:

"3\*4|2\*7.4|8\*3.2"

Where the number right before "*" denotes frequency and the float or integer right after "*" denotes value. These value frequency pairs are separated using "|".

So from "3\*4|2\*7.4|8\*3.2", I would like to get a following vector:

"4","4","4","7.4","7.4","3.2","3.2","3.2","3.2","3.2","3.2","3.2","3.2"

I have come up with following syntax, which completes with no errors and warnings, but the end results something else than expected:

strsplit("3*4|2*7.4|8*3.2", "[*|]") %>% #Split into a vector with two different separator characters
  unlist %>% #strsplit returns a list, so let's unlist it
         mapply(FUN = rep,
                x = .[seq(from = 2, to = length(.), by = 2)], #these sequences mean even and odd index in this respect
                times = .[seq(from = 1, to = length(.), by = 2)], #rep() flexibly accepts times argument also as string
                USE.NAMES = FALSE) %>%
         unlist #mapply returns a list, so let's unlist it

[1] "4"   "4"   "4"   "7.4" "7.4" "7.4" "7.4" "3.2" "3.2" "4"   "4"   "4"   "4"   "4"   "4"   "4"   "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "3.2" "3.2" "3.2"

As you can see, something weird has happened. "4" has been repeated three times, which is correct, but "7.4" has been repeated four times (incorrectly) and so on.

What is going on here?


Solution

  • 1a) The problem with the code in the question is that %>% is passing dot to the first argument of mapply To avoid this replace the mapply lines with this where ... represents the same arguments as in the question.

    { mapply(...) } %>%
    

    1b) Actually mapply is not needed in the first place since rep is vectorized:

    x %>%
      strsplit("[*|]") %>%
      unlist %>%
      { rep(x = .[seq(from = 2, to = length(.), by = 2)],
            times = .[seq(from = 1, to = length(.), by = 2)])
      }
     ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    1c) and a further simplification is to use logical values for the index realizing that they recycle:

    x %>%
      strsplit("[*|]") %>%
      unlist %>%
      { rep(x = .[c(FALSE, TRUE)], times = .[c(TRUE, FALSE)]) }
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    1d) A base R version using R's pipes is:

    x |>
      strsplit("[*|]") |>
      setNames("x") |>
      with(rep(x = x[c(FALSE, TRUE)], times = x[c(TRUE, FALSE)]))
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    Also note the following one-liners:

    2a) The following one-liner matches the two numbers and passes them as separate arguments to the anonymous function specified using formula notation returning the output of the function. The input x is from the question and defined explicitly in the Note at the end.

    library (gsubfn)
    
    strapply(x, "([0-9]+)\\*([0-9.]+)", n + x ~ rep(x, n))[[1]]
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    2b) If we have a character vector of strings like x then it will also work by removing the [[1]] . In that case it will return a list of the results.

    xx <- c(x, x)
    strapply(xx, "([0-9]+)\\*([0-9.]+)", n + x ~ rep(x, n))
    ## [[1]]
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    ##
    ## [[2]]
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    3) Another way to do it is to extract the repetition numbers and the values separately and pass each such vector to rep.

    library(gsubfn)
    
    rep(strapplyc(x, "\\*([0-9.]+)")[[1]], strapplyc(x, "(\\d+)\\*")[[1]])
    ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
    

    Note

    The input used is:

    x <- "3*4|2*7.4|8*3.2"