Search code examples
rset-intersectioncross-product

Compare intersections between groups specified in first column


Lets say I have a dataframe of three columns: The first one specifies the number of a feature (e.g. color), the second one a group and the third one if the feature is present in that group (1) or missing in that group (0):

> d<-data.frame(feature=c("red","blue","green","yellow","red","blue","green","yellow"), group=c(rep("a",4),rep("b",4)),is_there=c(0,1,1,0,1,1,1,0))
> d
  feature group is_there
1     red     a        0
2    blue     a        1
3   green     a        1
4  yellow     a        0
5     red     b        1
6    blue     b        1
7   green     b        1
8  yellow     b        0

Now I would like to have a summary of how many features are: 1. only in group a, only in group b and how many are in present in both groups. Additionally I need to extract the name of features present in both groups. How can I do that? I imagine that a function like crossprod might help, but I cannot figure it out.

The output would be something like:

feature 
red     1
blue    2
green   2
yellow  0

or:

feature a b
red     0 1
blue    1 1
green   1 1
yellow  0 0

anyways i need a better overview over a quite big datafile (the original has hundreds of features in about 10 groups).


Solution

  • It sounds like a table is what you want. First we subset the rows such that the is_there column equals 1 and remove the third column. Then we call a table on that subset.

    > ( tab <- table(d[d$is_there == 1, -3]) )
    #         group
    # feature  a b
    #   blue   1 1
    #   green  1 1
    #   red    0 1
    #   yellow 0 0
    

    A table is a matrix-like object. We can operate on it in much the same way we operate on a matrix.

    Looking at group a :

    > tab[,"a"]                           ## vector of group "a"
    #  blue  green    red yellow 
    #     1      1      0      0 
    > tab[,"a"][ tab[,"a"] > 0 ]          ## present in group "a"
    #  blue green 
    #     1     1 
    > names(tab[,"a"][ tab[,"a"] > 0 ])   ## "feature" present in group "a"
    # [1] "blue"  "green"
    

    And the same for group b.