I have a data frame with columns of categorical data supplied as strings. The categories for each column are the same, e.g.:
myDF=data.frame(col1=sample(c("a","b","c"),10,replace=T),
col2=sample(c("a","b","c"),10,replace=T),
col3=sample(c("a","b","c"),10,replace=T))
I would like to generate a table of counts in each category by column.
When all columns have all categories included, this can be done with apply
using the function table
, for instance:
> myDF
col1 col2 col3
1 a c a
2 b b b
3 a a b
4 b b a
5 c c a
6 a a a
7 a c c
8 a a c
9 c a a
10 a a b
> apply(myDF,2,table)
col1 col2 col3
a 6 5 5
b 2 2 3
c 2 3 2
However, if one column is missing some categories, this doesn't work because table
doesn't know what categories to expect:
myDF=data.frame(col1=sample(c("a","b","c"),10,replace=T),
col2=sample(c("a","b","c"),10,replace=T),
col3=sample(c("a","b"),10,replace=T))
Gives:
> myDF
col1 col2 col3
1 c a a
2 a a b
3 b a a
4 c c a
5 c a a
6 c c a
7 c b a
8 c b b
9 a a a
10 b b a
> apply(myDF,2,table)
$col1
a b c
2 2 6
$col2
a b c
5 3 2
$col3
a b
8 2
How can I produce a table that looks like the first one, with 0 for any missing categories?
You could collect all the factor levels and use those in apply
:
#get the levels from the whole data.frame
all_levels <- levels(unlist(myDF))
#convert each column to factor using the levels from above
#and then use table (which will return a zero for any missing levels)
apply(myDF, 2, function(x) table(factor(x, levels = all_levels)))
Output:
col1 col2 col3
a 1 4 7
b 5 2 3
c 4 4 0
> myDF
col1 col2 col3
1 b a a
2 c b a
3 c c b
4 b a b
5 b c a
6 c c a
7 c b a
8 b a b
9 a c a
10 b a a