Search code examples
rfrequencycategorical-datafrequency-distribution

R get frequency distribution by a categorical or factor column


I have a data like below. If I want to find frequency distribution then I can use hist command as below and using histz$breaks and histz$counts find number of observations that fall within each range.

I would like to get distribution of column b by value in column a. My column a is going to have 6 distinct values.

My expected output is a data frame which would have

  • 1st column - break value
  • 2nd column - when first column of trial has value a then counts of values that fall in ranges defined by break values
  • 3rd column - when first column of trial has value b then counts of values that fall in ranges defined by break values
  • 4th to 7th column - similar logic as the earlier 2 columns

My data

a=c("a","a","b","a","b","b","c","a")

b=c(1,3,4,3,5,7,8,9)

trial=data.frame(a,b)

histz=hist(trial$b, breaks=c(0,4,6,100),plot=FALSE)

histz

Solution

  • You can use cut() to categorize b, then table() to obtain the distribution in each range. In your example

    tab = table(cut(trial$b,breaks=c(0,4,6,100)),trial$a)
    

    Produces

              a b c
      (0,4]   3 1 0
      (4,6]   0 1 0
      (6,100] 1 1 1
    

    If you want proportions you can use

    ptab = prop.table(tab,margin=2)
    

    and for formatting 2 digits

    rtab = round(ptab,2)
    

    resulting in

                 a    b    c
      (0,4]   0.75 0.33 0.00
      (4,6]   0.00 0.33 0.00
      (6,100] 0.25 0.33 1.00
    

    Finally, if you want to convert do percent, use the formattable library

    library(formattable)
    prtab = apply(rtab,1:2,percent,digits=0)
    
              a     b     c     
      (0,4]   "75%" "33%" "0%"  
      (4,6]   "0%"  "33%" "0%"  
      (6,100] "25%" "33%" "100%"
    

    You can control the precision with the digits argument.