Search code examples
rggplot2bar-charttidyversebinning

Automatically creating bins for a numeric variable in r


So I have a variable as below.

var <- c(0L, 5L, 4L, 115L, 0L, 0L, 0L, 2L, 365L, 4L, 20L, 61L, 365L, 
0L, 365L, 0L, 14L, 0L, 0L, 72L, 0L, 0L, 6L, 105L, 150L, 0L, 365L, 
0L, 1L, 28L, 161L, 6L, 0L, 2L, 12L, 0L, 10L, 49L, 7L, 2L, 51L, 
0L, 0L, 11L, 0L, 0L, 17L, 0L, 0L, 7L, 0L, 28L, 0L, 0L, 0L, 44L, 
0L, 3L, 0L, 0L, 0L, 1L, 1L, 0L, 4L, 87L, 0L, 321L, 0L, 0L, 0L, 
0L, 9L, 0L, 0L, 0L, 140L, 0L, 0L, 0L, 0L, 0L, 1L, 8L, 20L, 0L, 
4L, 14L, 3L, 0L, 0L, 0L, 39L, 4L, 9L, 0L, 0L, 0L, 1L, 7L)

I want to create bins of different sizes (or same no matter) to categorize and plot as a bar chart for this variable.

I know it's possible to find automatic/reccommended binning however I am unsure how to do so in R?

Tried using the bin() function to no avail . I read about the Jenks method as well, but is there a way to create the best possible bins in R?

Would like to use it to plot a bar plot in ggplot.


Solution

  • Your description sounds like you're wanting to plot a histogram of var. This can be done easily enough in ggplot using geom_histogram. The key here is that ggplot likes to have a data frame, so you just have to specify your variable in a dataframe first, which you can do inside the ggplot() function:

    ggplot(data.frame(var), aes(var)) + geom_histogram(color='black', alpha=0.2)
    

    Gives you this:

    enter image description here

    The default is to use 30 bins, but you can specify either number of bins via bins= or the size of the bins via binwidth=:

    ggplot(data.frame(var), aes(var)) + geom_histogram(bins=10, color='black', alpha=0.2)
    

    enter image description here

    If you want to plot the basic bar geom, then geom_histogram() works just fine. If you change to use the stat_bin() function instead, it will perform the same binning method, but then you can apply and use a different geom if you want to:

    ggplot(data.frame(var), aes(var)) +
      stat_bin(geom='area', bins=10, alpha=0.2, color='black')
    

    enter image description here

    If you're looking to grab just the numbers/data from "binning" a variable like you have, one of the simplest ways might be to use cut() from dplyr.

    Use of cut() is pretty simple. You specify the vector and a breaks= argument. Breaks can be specified a list of places where you want to "cut" your data (or "bin" your data), or you can just set breaks=10 and it will give you an evenly cut set of 10 bins. The result is a factor with levels= that correspond to the range for each of the breaks. In the case of var with breaks=10, you get the following:

    > var_cut <- cut(var, breaks = 10)
    > levels(var_cut)
     [1] "(-0.365,36.5]" "(36.5,73]"     "(73,110]"      "(110,146]"     "(146,182]"     "(182,219]"     "(219,256]"    
     [8] "(256,292]"     "(292,328]"     "(328,365]"