I like to bin a continuous numeric
variable while keeping it numeric
. There are several options to discretionize or categorize a continuous variable in a factor
variable, like this:
data(mtcars)
library(tidyverse)
mtcars <- mtcars %>% mutate(mpg_binned = cut_width(mpg, 2, closed = "right", boundary = 10))
as_tibble(mtcars %>% select(mpg, mpg_binned))
# A tibble: 32 × 2
mpg mpg_binned
<dbl> <fct>
1 21 (20,22]
2 21 (20,22]
3 22.8 (22,24]
4 21.4 (20,22]
5 18.7 (18,20]
6 18.1 (18,20]
7 14.3 (14,16]
8 24.4 (24,26]
9 22.8 (22,24]
10 19.2 (18,20]
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows
But I like to do various graph and operations with numerics further on. Thus I like to convert each initial value to the center of that interval. First observation remains 21, since it is the middle of (20,22]. Rounding does not work, because row 7 value 14.3 should become 15 (the middle of (14,16]).
You could split the mpg_binned
column into runs of digits and take the average, with something like:
mtcars$mid <- sapply(stringr::str_extract_all(mtcars$mpg_binned,"[0-9]+"),
function(x){mean(as.numeric(x))})