Search code examples
rcut

R cut Pretty-print Values Beyond Boundaries


Is there some functionality in R which will pretty-print a numeric vector converted into a factor when some values are beyond breaks? The desired input and output is

data <- seq(5, 95, 10)
result <- cutSpecial(data, breaks = c(30, 40, 50, 60, 70))
disc <- c("<30", "<30", "<30", "[30, 40)", "[40, 50)", "[50, 60)", "[60, 70)",
+   ">70", ">70", ">70")
cbind(data, disc)
     data disc      
 [1,] "5"  "<30"     
 [2,] "15" "<30"     
 [3,] "25" "<30"     
 [4,] "35" "[30, 40)"
 [5,] "45" "[40, 50)"
 [6,] "55" "[50, 60)"
 [7,] "65" "[60, 70)"
 [8,] "75" ">70"     
 [9,] "85" ">70"     
[10,] "95" ">70"     

The base R cut function simply turns values outside of the range into unsatisfying NA. What function in the R ecosystem would cutSpecial be?


Solution

  • It would be chop() from my santoku package:

    library(santoku)
    data <- seq(5, 95, 10)
    chop(data, c(30, 40, 50, 60, 70))
    ##  [1] [5, 30)  [5, 30)  [5, 30)  [30, 40) [40, 50) [50, 60) [60, 70) [70, 95] [70, 95]
    ## [10] [70, 95]
    ## Levels: [5, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 95]
    

    If you want specific labels you can either pass them in yourself:

    chop(data, c(30, 40, 50, 60, 70), c("< 30", "[30-40)", "[40-50)", "[50-60)", "[60-70)", ">= 70"))
    

    Or in the latest version, you can use lbl_dash() and specify first and last:

    chop(data, c(30, 40, 50, 60, 70), labels = lbl_dash(first = "< 30", last = ">= 70"))
    ##  [1] < 30    < 30    < 30    30 - 40 40 - 50 50 - 60 60 - 70 >= 70   >= 70   >= 70  
    ## Levels: < 30 30 - 40 40 - 50 50 - 60 60 - 70 >= 70
    

    There's no such argument for the default interval labels, but maybe there should be.