I want to filter a column containing vectors on the entire contents of the cell. I've looked at R dplyr. Filter a dataframe that contains a column of numeric vectors, but my need is slightly different.
Sample df (full reprex below)
df <- tibble::tribble(
~id, ~len, ~vec,
1L, 1L, 1L,
2L, 2L, 1:2,
3L, 2L, c(1L, 2L),
4L, 3L, c(1L, 2L, 3L),
5L, 3L, 1:3,
6L, 3L, c(1L, 3L, 2L),
7L, 3L, c(3L, 2L, 1L),
8L, 3L, c(1L, 3L, 2L),
9L, 4L, c(1L, 2L, 4L, 3L),
10L, 3L, c(3L, 2L, 1L)
)
gives (colour-coded for matches)
I can group_by the vec column:
dfg <- df %>%
group_by(vec) %>%
summarise(n = n()
,total_len = sum(len))
For individual cells, a straight comparison doesn't work, but identical does:
df$vec[4] == df$vec[5]
#> Error in df$vec[4] == df$vec[5]: comparison of these types is not implemented
identical(df$vec[4], df$vec[5])
#> [1] TRUE
But none of the equivalents work in a filter, which is what I need:
df %>% filter(vec == c(1L, 2L, 3L))
#> Warning in vec == c(1L, 2L, 3L): longer object length is not a multiple of
#> shorter object length
#> Error: Problem with `filter()` input `..1`.
#> x 'list' object cannot be coerced to type 'integer'
#> i Input `..1` is `vec == c(1L, 2L, 3L)`.
df %>% filter(identical(vec, c(1L, 2L, 3L)))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>
df %>% filter(identical(vec, vec[5]))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>
I'm sure there's a simple solution I'm missing.
A more advanced need is to match where the contents of the cell match in any order, so the 6 orange, purple and gold highlighted cells above would all match. A solution that also works with lists as well as vectors would also be great as this may be a future need.
Full reprex:
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble::tribble(
~id, ~len, ~vec,
1L, 1L, 1L,
2L, 2L, 1:2,
3L, 2L, c(1L, 2L),
4L, 3L, c(1L, 2L, 3L),
5L, 3L, 1:3,
6L, 3L, c(1L, 3L, 2L),
7L, 3L, c(3L, 2L, 1L),
8L, 3L, c(1L, 3L, 2L),
9L, 4L, c(1L, 2L, 4L, 3L),
10L, 3L, c(3L, 2L, 1L)
)
df
#> # A tibble: 10 x 3
#> id len vec
#> <int> <int> <list>
#> 1 1 1 <int [1]>
#> 2 2 2 <int [2]>
#> 3 3 2 <int [2]>
#> 4 4 3 <int [3]>
#> 5 5 3 <int [3]>
#> 6 6 3 <int [3]>
#> 7 7 3 <int [3]>
#> 8 8 3 <int [3]>
#> 9 9 4 <int [4]>
#> 10 10 3 <int [3]>
dfg <- df %>%
group_by(vec) %>%
summarise(n = n()
,total_len = sum(len))
#> `summarise()` ungrouping output (override with `.groups` argument)
dfg
#> # A tibble: 6 x 3
#> vec n total_len
#> <list> <int> <int>
#> 1 <int [1]> 1 1
#> 2 <int [2]> 2 4
#> 3 <int [3]> 2 6
#> 4 <int [3]> 2 6
#> 5 <int [3]> 2 6
#> 6 <int [4]> 1 4
df$vec[4] == df$vec[5]
#> Error in df$vec[4] == df$vec[5]: comparison of these types is not implemented
identical(df$vec[4], df$vec[5])
#> [1] TRUE
df %>% filter(vec == c(1L, 2L, 3L))
#> Warning in vec == c(1L, 2L, 3L): longer object length is not a multiple of
#> shorter object length
#> Error: Problem with `filter()` input `..1`.
#> x 'list' object cannot be coerced to type 'integer'
#> i Input `..1` is `vec == c(1L, 2L, 3L)`.
df %>% filter(identical(vec, c(1L, 2L, 3L)))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>
df %>% filter(identical(vec, vec[5]))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>
Created on 2021-01-13 by the reprex package (v0.3.0)
Throw in rowwise
and also check the length
of vector to compare to avoid the warnings.
library(dplyr)
compare <- c(1L, 2L, 3L)
df %>%
rowwise() %>%
filter(length(vec) == length(compare) && all(vec == compare))
# id len vec
# <int> <int> <list>
#1 4 3 <int [3]>
#2 5 3 <int [3]>
We can filter
by length first which might be faster on larger datasets.
df %>%
filter(lengths(vec) == length(compare)) %>%
rowwise() %>%
filter(all(vec == compare))
Similar logic in base R :
subset(df, sapply(vec, function(x)
length(x) == length(compare) && all(x == compare)))