Search code examples
rdataframegroup-byintersect

count values in one dataframe in another dataframe


Assume I have a dataframe like this:

df <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2, 3, 78, 124, 1)))
)

and another one like this:

clusters <- data.frame(
 g = I(list(c(124, 78, 1),c(2, 3, 89),c(532, 533)))
)

I want to count the times the elements in each id list are in the groups g1, g2, g3 (each row in clusters) when it's more than once. So basically count the intersection whenever the intersection > 1.

The output should be:

out <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2,3,78, 124, 1))),
  count = c(2, 0, 5)
)

I cannot do it with for loop cause it will take forever as I have 70k rows.

A for loop would look like this:

for row in df:
  for group in clusters:
     if group intersect row$id > 1
        count=count+intersection
  count=0

Solution

  • Probably you can try base R

    transform(
        df,
        count = rowSums(outer(id, clusters$g, \(...) mapply(\(...) {
            l <- length(intersect(...))
            l * (l > 1)
        }, ...)))
    )
    

    or dplyr

    df %>%
        left_join(df %>%
            unnest(id) %>%
            left_join(clusters %>%
                mutate(grp = row_number()) %>%
                unnest(g), by = join_by(id == g)) %>%
            summarise(count = {
                d <- table(grp)
                sum(d[d > 1])
            }, .by = c(D, Q)))
    

    which gives

      D   Q           id count
    1 A asd 124, 532, 78     2
    2 B reg    1, 3, 532     0
    3 C  rt 2, 3, 78....     5