Using the tidyverse, I'm looking to discretize numerical data with the goal of using a bar chart to plot the different numerical ranges as if the data were categorical, by manually declaring where the cuts occur, such as with age groups or income ranges. I wish to have intervals of unequal width.
So far, I've tried the base R approach, using cut()
and setting the bins with breaks = c()
. I notice, however, that there exist a set of functions cut_interval
, cut_width
, and cut_number
in the ggplot2
package. I figure that there's a way to manually set the interval cuts using these functions, because the breaks
argument exists for the interval and number variant.
library(tidyverse)
mtcars <- as_tibble(mtcars)
mtcars %>%
count(cut_interval(mpg, n = 4))
#> # A tibble: 4 x 2
#> `cut_interval(mpg, n = 4)` n
#> <fct> <int>
#> 1 [10.4,16.3] 10
#> 2 (16.3,22.1] 13
#> 3 (22.1,28] 5
#> 4 (28,33.9] 4
mtcars %>%
count(cut_interval(mpg, n = 4, breaks = c(10, 18, 23, 28, 35)))
#> Error: Evaluation error: lengths of 'breaks' and 'labels' differ.
Created on 2019-06-03 by the reprex package (v0.2.1)
The above is close to what I want, but it sets the breaks based on the number of intervals.
In the above example, I would like my groups to be precisely as follows:
10-18, 19-23, 24-28, 29-35.
Is this possible using the breaks
argument? Thank you.
You can just use the actual base cut
function to do this:
library(tidyverse)
mtcars %>%
mutate(bin = cut(mpg, breaks = c(Inf, 10, 18, 19, 23, 24, 28, 29,35))) %>%
count(bin)
Which will give you:
# A tibble: 5 x 2
bin n
<fct> <int>
1 (10,18] 13
2 (18,19] 2
3 (19,23] 10
4 (24,28] 3
5 (29,35] 4