Search code examples
rggplot2categorical-datacontinuous

Tidyverse: Converting numerical data into categorical data for plotting with uneven bin width


Using the tidyverse, I'm looking to discretize numerical data with the goal of using a bar chart to plot the different numerical ranges as if the data were categorical, by manually declaring where the cuts occur, such as with age groups or income ranges. I wish to have intervals of unequal width.

So far, I've tried the base R approach, using cut() and setting the bins with breaks = c(). I notice, however, that there exist a set of functions cut_interval, cut_width, and cut_number in the ggplot2 package. I figure that there's a way to manually set the interval cuts using these functions, because the breaks argument exists for the interval and number variant.

library(tidyverse)

mtcars <- as_tibble(mtcars)

mtcars %>% 
  count(cut_interval(mpg, n = 4))
#> # A tibble: 4 x 2
#>   `cut_interval(mpg, n = 4)`     n
#>   <fct>                      <int>
#> 1 [10.4,16.3]                   10
#> 2 (16.3,22.1]                   13
#> 3 (22.1,28]                      5
#> 4 (28,33.9]                      4

mtcars %>% 
  count(cut_interval(mpg, n = 4, breaks = c(10, 18, 23, 28, 35)))
#> Error: Evaluation error: lengths of 'breaks' and 'labels' differ.

Created on 2019-06-03 by the reprex package (v0.2.1)

The above is close to what I want, but it sets the breaks based on the number of intervals.

In the above example, I would like my groups to be precisely as follows:

10-18, 19-23, 24-28, 29-35.

Is this possible using the breaks argument? Thank you.


Solution

  • You can just use the actual base cut function to do this:

    library(tidyverse)
    
    mtcars %>% 
        mutate(bin = cut(mpg, breaks = c(Inf, 10, 18, 19, 23, 24, 28, 29,35))) %>% 
        count(bin)
    

    Which will give you:

    # A tibble: 5 x 2
      bin         n
      <fct>   <int>
    1 (10,18]    13
    2 (18,19]     2
    3 (19,23]    10
    4 (24,28]     3
    5 (29,35]     4