Search code examples
rtidyversedata-manipulation

How to bin a continuous variable while keeping its numeric feature in R?


I like to bin a continuous numeric variable while keeping it numeric. There are several options to discretionize or categorize a continuous variable in a factor variable, like this:

data(mtcars)

library(tidyverse)

mtcars <- mtcars %>% mutate(mpg_binned = cut_width(mpg, 2, closed = "right", boundary = 10))
as_tibble(mtcars %>% select(mpg, mpg_binned))

# A tibble: 32 × 2
     mpg mpg_binned
   <dbl> <fct>     
 1  21   (20,22]   
 2  21   (20,22]   
 3  22.8 (22,24]   
 4  21.4 (20,22]   
 5  18.7 (18,20]   
 6  18.1 (18,20]   
 7  14.3 (14,16]   
 8  24.4 (24,26]   
 9  22.8 (22,24]   
10  19.2 (18,20]   
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

But I like to do various graph and operations with numerics further on. Thus I like to convert each initial value to the center of that interval. First observation remains 21, since it is the middle of (20,22]. Rounding does not work, because row 7 value 14.3 should become 15 (the middle of (14,16]).


Solution

  • You could split the mpg_binned column into runs of digits and take the average, with something like:

    mtcars$mid <- sapply(stringr::str_extract_all(mtcars$mpg_binned,"[0-9]+"), 
                         function(x){mean(as.numeric(x))})