Search code examples
rquantilepercentile

How to break my sample into uneven categories?


We usually use quartiles,quantiles, or ntiles to split a sample. We can also use the function cut.

I have a numeric variable where i would like to split my sample into three categories. But these should not be evenly spaced. For example, the quartile function would split it to four evenly spaced quartiles. These are 0 to 25, 26 to 50, 51 to 75, and 76 to 100 percentiles. Therefore, the first three functions i mentioned cannot do the job. We can probably split the variable using cut, but I don't know how to do it in terms of percentile. I would like to create a variable that split the sample from the bottom 0 to the 20th percentile, then from 21 to 60, then from 61 to 100.

Here is a reproducible code:

    library(dplyr)
    set.seed(1)
df <- tibble(
  V1 = round(runif(1000,min=1, max=1000)),
  V2 = round(runif(1000, min=1, max=3)),
  V3 = round(runif(1000, min=1, max=10)))

df$V2 = as.factor(df$V2)
df$V3 = as.factor(df$V3)

    
 df=df %>% group_by(V2,V3) %>%
 mutate(quartile = ntile(V1,4))

Solution

  • I'm not 100% sure if this is what you're looking for, and I'll admit it's not the most elegant code ever written, but would something like:

    cut.20 <- 20/100*length(df$V1)
    cut.60 <- 60/100*length(df$V1)
    #define your percentile limits (this is just based on googling how to calculate percentiles)
    
    df <- arrange(df, V1) %>% 
          mutate("index" = c(1:nrow(df))) %>% 
          group_by(V2, V3) %>%
          mutate("centile" = case_when(index > 0 & index <= cut.20 ~ "0-20",
                                       index > cut.20 & index <= cut.60 ~ "21-60",
                                       index > cut.60 ~ "60-100"))
    

    do what you're looking for?