Lets say I have a dataframe of three columns: The first one specifies the number of a feature (e.g. color), the second one a group and the third one if the feature is present in that group (1) or missing in that group (0):
> d<-data.frame(feature=c("red","blue","green","yellow","red","blue","green","yellow"), group=c(rep("a",4),rep("b",4)),is_there=c(0,1,1,0,1,1,1,0))
> d
feature group is_there
1 red a 0
2 blue a 1
3 green a 1
4 yellow a 0
5 red b 1
6 blue b 1
7 green b 1
8 yellow b 0
Now I would like to have a summary of how many features are: 1. only in group a, only in group b and how many are in present in both groups. Additionally I need to extract the name of features present in both groups. How can I do that? I imagine that a function like crossprod
might help, but I cannot figure it out.
The output would be something like:
feature
red 1
blue 2
green 2
yellow 0
or:
feature a b
red 0 1
blue 1 1
green 1 1
yellow 0 0
anyways i need a better overview over a quite big datafile (the original has hundreds of features in about 10 groups).
It sounds like a table
is what you want. First we subset the rows such that the is_there
column equals 1 and remove the third column. Then we call a table
on that subset.
> ( tab <- table(d[d$is_there == 1, -3]) )
# group
# feature a b
# blue 1 1
# green 1 1
# red 0 1
# yellow 0 0
A table
is a matrix-like object. We can operate on it in much the same way we operate on a matrix
.
Looking at group a
:
> tab[,"a"] ## vector of group "a"
# blue green red yellow
# 1 1 0 0
> tab[,"a"][ tab[,"a"] > 0 ] ## present in group "a"
# blue green
# 1 1
> names(tab[,"a"][ tab[,"a"] > 0 ]) ## "feature" present in group "a"
# [1] "blue" "green"
And the same for group b
.