I am trying to add a list column to a tibble
data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.
Suppose that the data frame and the vector are the following:
library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))
vec <- c(12, 25, 28, 58, 98)
Now, I would like to add (mutate
) the column y
so that for each row y
is a list containing the elements of vec
between A
and B
(inclusive).
The not-so-proper way to do this would be via loop. I initialize the column y
as list and update it row-wise based on the condition A <= vec & vec <= B
:
dat %<>%
mutate(y = list(vec))
for (i in 1:nrow(dat)){
dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec & vec <= dat[i,]$B])
}
The result is a data frame with y
being a list of dbl
of variable length:
> dat
# A tibble: 9 x 3
A B y
<dbl> <dbl> <list>
1 12 68 <dbl [4]>
2 27 46 <dbl [1]>
3 22 69 <dbl [3]>
4 1 7 <dbl [0]>
5 15 44 <dbl [2]>
6 30 76 <dbl [1]>
7 20 72 <dbl [3]>
8 28 50 <dbl [1]>
9 19 51 <dbl [2]>
The first four values of y
are:
[[1]]
[1] 12 25 28 58
[[2]]
[1] 28
[[3]]
[1] 25 28 58
[[4]]
numeric(0)
Note: the 4-th list is empty, because no value of vec
is between A=1
and B=7
.
I have tried as an intermediate step with getting the subscripts via which
using mutate(y = list(which(A <= vec & vec <= B)))
or with a combination of seq
and %in%
, for instance mutate(y = list(vec %in% seq(A, B)))
. These both give an error. However, I don't need the subscripts, I need a subset of vec
.
Create a small helper function with the logic that you want to implement.
return_values_in_between <- function(vec, A, B) {
vec[A <= vec & vec <= B]
}
and call the function for each row (using rowwise
) -
library(dplyr)
result <- dat %>%
rowwise() %>%
mutate(y = list(return_values_in_between(vec, A, B))) %>%
ungroup()
result
# A tibble: 9 × 3
# A B y
# <dbl> <dbl> <list>
#1 12 68 <dbl [4]>
#2 27 46 <dbl [1]>
#3 22 69 <dbl [3]>
#4 1 7 <dbl [0]>
#5 15 44 <dbl [2]>
#6 30 76 <dbl [1]>
#7 20 72 <dbl [3]>
#8 28 50 <dbl [1]>
#9 19 51 <dbl [2]>
Checking the first 4 values in result$y
-
result$y
#[[1]]
#[1] 12 25 28 58
#[[2]]
#[1] 28
#[[3]]
#[1] 25 28 58
#[[4]]
#numeric(0)
#...
#...