Search code examples
rlisttidyversematch

How to count the number of matches in a tibble list column to another data frame in R?


I have two sets of data:

The first data frame (small) is relatively smaller than the second data frame (large). Each data frame has an id column with unique identifiers. The smaller data frame has a list column called links, which contains a list of links to the larger second data frame. The larger data frame has a column of attributes, we'll call att:

library(tidyverse)

a <- c(3, 3, NA, 5)
b <- c(NA, 3, 4, 5)

small <- tibble(id = c(1, 2),
                links = list(a, b))

large <- tibble(id = c(3, 4, 5),
                att = c("yes", "no", "maybe"))

My goal is to count the number of times each observation in the small data frame has links to observations with the "yes" attribute in the large data frame.

I feel like something like this is on the right track, but it isn't quite counting correctly:

counted <- small %>%
  mutate(count_yes = map_int(links, ~ sum(large$att[large$id %in% .x] == "yes")))

print(counted)
#> # A tibble: 2 × 3
#>      id links     count_yes
#>   <dbl> <list>        <int>
#> 1     1 <dbl [4]>         1
#> 2     2 <dbl [4]>         1

Here, count_yes appears as only 1, when it should read as a 2 and a 1.


Solution

  • You are on the right track but need some adjustment.

    small %>%
      mutate(count_yes = map_int(links, ~sum(.x %in% large$id[large$att %in% "yes"])))
    
    #     id links     count_yes
    #  <dbl> <list>        <int>
    #1     1 <dbl [4]>         2
    #2     2 <dbl [4]>         1
    

    Or in base R :

    sapply(small$links, \(x) sum(x %in% large$id[large$att %in% "yes"]))
    

    Note the use of %in% instead of == would return FALSE for NA values.