Search code examples
rdataframedplyrtidyversevenn-diagram

How to identify and tally intersection items in R


I have a data frame which shows membership in three color classes. Numbers refer to unique IDs. One ID may be a part of one group or multiple groups.

dat <- data.frame(BLUE = c(1, 2, 3, 4, 6, NA),
                  RED = c(2, 3, 6, 7, 9, 13),
                  GREEN = c(4, 6, 8, 9, 10, 11))

or for visual reference:

BLUE  RED  GREEN
1     2    4
2     3    6
3     6    8
4     7    9
6     9    10
NA    13   11

I need to identify and tally individual and cross group membership (i.e. how many IDs were only in red, how many were in both red and blue, etc.) My desired output is below. Please note that the IDs column is simply for reference, that column would not be in the expected output.

COLOR                TOTAL  IDs (reference only, not needed in final output)
RED                  2      (7, 13)
BLUE                 1      (1)
GREEN                3      (8, 10, 11)
RED, BLUE            3      (2, 3, 6)
RED, GREEN           2      (6, 9)
BLUE, GREEN          2      (4, 6)
RED, BLUE, GREEN     1      (6)

Does anyone know an efficient way to do this in R? Thanks!


Solution

  • library(dplyr)
    library(tidyr)
    
    cbind(dat, row = 1:6) %>% 
      gather(COLOR, IDs, -row) %>% 
      group_by(IDs) %>% 
      nest(COLOR, .key="COLOR") %>% 
      mutate(COLOR = sapply(COLOR, as.character)) %>% 
      drop_na %>% 
      group_by(COLOR) %>% 
      add_count(name="TOTAL") %>% 
      group_by(COLOR, TOTAL) %>% 
      nest(IDs, .key = "IDs") %>% 
      as.data.frame
    
    #>                       COLOR TOTAL       IDs
    #> 1                      BLUE     1         1
    #> 2          c("BLUE", "RED")     2      2, 3
    #> 3        c("BLUE", "GREEN")     1         4
    #> 4 c("BLUE", "RED", "GREEN")     1         6
    #> 5                       RED     2     7, 13
    #> 6         c("RED", "GREEN")     1         9
    #> 7                     GREEN     3 8, 10, 11
    


    There's a more conventional method to deal with NA in venn package:

    library(purrr)
    library(magrittr)
    library(venn)
    
    as.list(dat) %>%
      map(discard, is.na) %>%
      compact() %>% 
      venn() %>% 
      print
    
        #>                BLUE RED GREEN counts
        #>                   0   0     0      0
        #> GREEN             0   0     1      3
        #> RED               0   1     0      2
        #> RED:GREEN         0   1     1      1
        #> BLUE              1   0     0      1
        #> BLUE:GREEN        1   0     1      1
        #> BLUE:RED          1   1     0      2
        #> BLUE:RED:GREEN    1   1     1      1
    

    There are many other packages for venn diagram in R according to this answer.

    For instance, VennDiagram::venn.diagram package has an na variable which gets stop, remove, and none. So, here we would use remove; however, it will only give us the diagram and not the table. You can explore other possibilities in other packages.