r frequency categorical-data frequency-distribution

R get frequency distribution by a categorical or factor column

I have a data like below. If I want to find frequency distribution then I can use hist command as below and using histz$breaks and histz$counts find number of observations that fall within each range.

I would like to get distribution of column b by value in column a. My column a is going to have 6 distinct values.

My expected output is a data frame which would have

1st column - break value
2nd column - when first column of trial has value a then counts of values that fall in ranges defined by break values
3rd column - when first column of trial has value b then counts of values that fall in ranges defined by break values
4th to 7th column - similar logic as the earlier 2 columns

My data

a=c("a","a","b","a","b","b","c","a")

b=c(1,3,4,3,5,7,8,9)

trial=data.frame(a,b)

histz=hist(trial$b, breaks=c(0,4,6,100),plot=FALSE)

histz

Solution

You can use cut() to categorize b, then table() to obtain the distribution in each range. In your example

tab = table(cut(trial$b,breaks=c(0,4,6,100)),trial$a)

Produces

          a b c
  (0,4]   3 1 0
  (4,6]   0 1 0
  (6,100] 1 1 1

If you want proportions you can use

ptab = prop.table(tab,margin=2)

and for formatting 2 digits

rtab = round(ptab,2)

resulting in

             a    b    c
  (0,4]   0.75 0.33 0.00
  (4,6]   0.00 0.33 0.00
  (6,100] 0.25 0.33 1.00

Finally, if you want to convert do percent, use the formattable library

library(formattable)
prtab = apply(rtab,1:2,percent,digits=0)

          a     b     c     
  (0,4]   "75%" "33%" "0%"  
  (4,6]   "0%"  "33%" "0%"  
  (6,100] "25%" "33%" "100%"

You can control the precision with the digits argument.