Search code examples
rdataframecut

Why Decile values are incorrect using the cut function


I tried to attach a decile value for each observation using the code below.However, it seems that the values are not correct. What can be the reason for that?

     df<-read.table(text="pregnant glucose blood skin INSULIN MASS  DIAB AGE CLASS  predict_probability 
                                  1     106    70   28     135 34.2 0.142  22     0       0.15316285       
                                  1      91    54   25     100 25.2 0.234  23     0       0.05613959       
                                  4     136    70    0       0 31.2 1.182  22     1       0.54034794       
                                  9     164    78    0       0 32.8 0.148  45     1       0.64361578       
                                  3     173    78   39     185 33.8 0.970  31     1       0.79185196       
                                 11     136    84   35     130 28.3 0.260  42     1       0.31927737       
                                  0     141    84   26       0 32.4 0.433  22     0       0.41609308       
                                  3     106    72    0       0 25.8 0.207  27     0       0.10460090       
                                  9     145    80   46     130 37.9 0.637  40     1       0.67061324       
                                 10     111    70   27       0 27.5 0.141  40     1       0.16152296       
                       ",header=T)

deciles <- cut(df$predict_probability, breaks=c(quantile(df$predict_probability, probs = seq(0, 1, by = 0.10))),labels = 1:10, include.lowest=TRUE)
        df1 <- cbind(df,deciles)
        head(df1,10)
           pregnant glucose blood skin INSULIN MASS  DIAB AGE CLASS predict_probability deciles
        1         1     106    70   28     135 34.2 0.142  22     0          0.15316285       3
        2         1      91    54   25     100 25.2 0.234  23     0          0.05613959       1
        3         4     136    70    0       0 31.2 1.182  22     1          0.54034794       7
        4         9     164    78    0       0 32.8 0.148  45     1          0.64361578       8
        5         3     173    78   39     185 33.8 0.970  31     1          0.79185196      10
        6        11     136    84   35     130 28.3 0.260  42     1          0.31927737       5
        7         0     141    84   26       0 32.4 0.433  22     0          0.41609308       6
        8         3     106    72    0       0 25.8 0.207  27     0          0.10460090       2
        9         9     145    80   46     130 37.9 0.637  40     1          0.67061324       9
        10       10     111    70   27       0 27.5 0.141  40     1          0.16152296       4

Solution

  • Per Dason's proposal, here is the full answer to the question. The quantile function should be taken out from the code so seq(0,1,by=0.1) should be passed directly to the cut function.

        deciles <- cut(df$predict_probability, seq(0,1,by=0.1) ,labels = 1:10, include.lowest=TRUE)
        df1 <- cbind(df,deciles)
        head(df1,10)
     pregnant glucose blood skin INSULIN MASS  DIAB AGE CLASS predict_probability deciles
    1         1     106    70   28     135 34.2 0.142  22     0          0.15316285       2
    2         1      91    54   25     100 25.2 0.234  23     0          0.05613959       1
    3         4     136    70    0       0 31.2 1.182  22     1          0.54034794       6
    4         9     164    78    0       0 32.8 0.148  45     1          0.64361578       7
    5         3     173    78   39     185 33.8 0.970  31     1          0.79185196       8
    6        11     136    84   35     130 28.3 0.260  42     1          0.31927737       4
    7         0     141    84   26       0 32.4 0.433  22     0          0.41609308       5
    8         3     106    72    0       0 25.8 0.207  27     0          0.10460090       2
    9         9     145    80   46     130 37.9 0.637  40     1          0.67061324       7
    10       10     111    70   27       0 27.5 0.141  40     1          0.16152296       2