Search code examples
rdplyrpurrrmutate

dplyr mutate based on columns condition and an external vector


I am trying to add a list column to a tibble data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.

Suppose that the data frame and the vector are the following:

library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
              B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))

vec <- c(12, 25, 28, 58, 98)

Now, I would like to add (mutate) the column y so that for each row y is a list containing the elements of vec between A and B (inclusive).

The not-so-proper way to do this would be via loop. I initialize the column y as list and update it row-wise based on the condition A <= vec & vec <= B:

dat %<>% 
  mutate(y = list(vec))

for (i in 1:nrow(dat)){
  dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec &  vec <= dat[i,]$B])
}

The result is a data frame with y being a list of dbl of variable length:

> dat
# A tibble: 9 x 3
      A     B y        
  <dbl> <dbl> <list>   
1    12    68 <dbl [4]>
2    27    46 <dbl [1]>
3    22    69 <dbl [3]>
4     1     7 <dbl [0]>
5    15    44 <dbl [2]>
6    30    76 <dbl [1]>
7    20    72 <dbl [3]>
8    28    50 <dbl [1]>
9    19    51 <dbl [2]>

The first four values of y are:

[[1]]
[1] 12 25 28 58

[[2]]
[1] 28

[[3]]
[1] 25 28 58

[[4]]
numeric(0)

Note: the 4-th list is empty, because no value of vec is between A=1 and B=7.

I have tried as an intermediate step with getting the subscripts via which using mutate(y = list(which(A <= vec & vec <= B))) or with a combination of seq and %in%, for instance mutate(y = list(vec %in% seq(A, B))). These both give an error. However, I don't need the subscripts, I need a subset of vec.


Solution

  • Create a small helper function with the logic that you want to implement.

    return_values_in_between <- function(vec, A, B) {
      vec[A <= vec & vec <= B]
    }
    

    and call the function for each row (using rowwise) -

    library(dplyr)
    
    result <- dat %>%
      rowwise() %>%
      mutate(y = list(return_values_in_between(vec, A, B))) %>%
      ungroup()
    
    result
    
    # A tibble: 9 × 3
    #      A     B    y        
    #   <dbl> <dbl> <list>   
    #1    12    68 <dbl [4]>
    #2    27    46 <dbl [1]>
    #3    22    69 <dbl [3]>
    #4     1     7 <dbl [0]>
    #5    15    44 <dbl [2]>
    #6    30    76 <dbl [1]>
    #7    20    72 <dbl [3]>
    #8    28    50 <dbl [1]>
    #9    19    51 <dbl [2]>
    

    Checking the first 4 values in result$y -

    result$y
    #[[1]]
    #[1] 12 25 28 58
    
    #[[2]]
    #[1] 28
    
    #[[3]]
    #[1] 25 28 58
    
    #[[4]]
    #numeric(0)
    
    #...
    #...