I have two sets of data:
The first data frame (small
) is relatively smaller than the second data frame (large
). Each data frame has an id
column with unique identifiers. The smaller data frame has a list column called links
, which contains a list of links to the larger second data frame. The larger data frame has a column of attributes, we'll call att
:
library(tidyverse)
a <- c(3, 3, NA, 5)
b <- c(NA, 3, 4, 5)
small <- tibble(id = c(1, 2),
links = list(a, b))
large <- tibble(id = c(3, 4, 5),
att = c("yes", "no", "maybe"))
My goal is to count the number of times each observation in the small
data frame has links to observations with the "yes"
attribute in the large
data frame.
I feel like something like this is on the right track, but it isn't quite counting correctly:
counted <- small %>%
mutate(count_yes = map_int(links, ~ sum(large$att[large$id %in% .x] == "yes")))
print(counted)
#> # A tibble: 2 × 3
#> id links count_yes
#> <dbl> <list> <int>
#> 1 1 <dbl [4]> 1
#> 2 2 <dbl [4]> 1
Here, count_yes
appears as only 1, when it should read as a 2 and a 1.
You are on the right track but need some adjustment.
small %>%
mutate(count_yes = map_int(links, ~sum(.x %in% large$id[large$att %in% "yes"])))
# id links count_yes
# <dbl> <list> <int>
#1 1 <dbl [4]> 2
#2 2 <dbl [4]> 1
Or in base R :
sapply(small$links, \(x) sum(x %in% large$id[large$att %in% "yes"]))
Note the use of %in%
instead of ==
would return FALSE
for NA
values.