Search code examples
rbinning

Create manual Bin after condition is satisfied


Given a dataframe df that looks like this after using a cut:

      X  Binned_X Dependent     Importance
1  -236      -236         0   -0.25131443
2  -236      -236         1   -0.25131443
3  -236      -236         1   -0.25131443
4  -236      -236         1   -0.25131443
5  -236      -236         0   -0.25131443
6  -236      -236         1   -0.25131443
7  -236      -236         0   -0.25131443
8   320 [244,485]         1    0.06713930
9   320 [244,485]         0    0.06713930
10  320 [244,485]         1    0.06713930
11  320 [244,485]         1    0.06713930
12  244 [244,485]         1    0.06713930
13  244 [244,485]         1    0.06713930
14  244 [244,485]         0    0.06713930
15  244 [244,485]         1    0.06713930
16  485 [244,485]         0    0.06713930
17  485 [244,485]         1    0.06713930
18  485 [244,485]         1    0.06713930
19  485 [244,485]         1    0.06713930
20  485 [244,485]         1    0.06713930
21  485 [244,485]         0    0.06713930
22  485 [244,485]         0    0.06713930
23  485 [244,485]         0    0.06713930
24  485 [244,485]         1    0.06713930
25   25  [25,244)         1    0.04879016
26   25  [25,244)         1    0.04879016
27   25  [25,244)         1    0.04879016
28   25  [25,244)         1    0.04879016
29   25  [25,244)         0    0.04879016
30   25  [25,244)         1    0.04879016
31   25  [25,244)         1    0.04879016
32  108  [25,244)         1    0.04879016
33  108  [25,244)         0    0.04879016
34  108  [25,244)         0    0.04879016
35  108  [25,244)         0    0.04879016
36  108  [25,244)         1    0.04879016
37  108  [25,244)         1    0.04879016
38  108  [25,244)         0    0.04879016

I would like to replace the column "Binned_X" if the difference of the value in the column "Importance" between bins is lower than N (let's say 0.2)

It is important to say that since X is a continuous variable, if we want to merge some bins, they need to be right next in order (example, if we have bins "1, 2 and 3", "1" could only merge with "2", "2" could merge with "1" or "3", and "3" could only merge with "2"). So in this particular example, bin "-236" could only merge with "[25,244)", while bin "[25,244)" could merge with "-236" and "[244,485]" and so on.

The desired output would be something like this:

      X  Binned_X
1  -236      -236
2  -236      -236
3  -236      -236
4  -236      -236
5  -236      -236
6  -236      -236
7  -236      -236
8   320  [25,485]
9   320  [25,485]
10  320  [25,485]
11  320  [25,485]
12  244  [25,485]
13  244  [25,485]
14  244  [25,485]
15  244  [25,485]
16  485  [25,485]
17  485  [25,485]
18  485  [25,485]
19  485  [25,485]
20  485  [25,485]
21  485  [25,485]
22  485  [25,485]
23  485  [25,485]
24  485  [25,485]
25   25  [25,485)
26   25  [25,485)
27   25  [25,485)
28   25  [25,485)
29   25  [25,485)
30   25  [25,485)
31   25  [25,485)
32  108  [25,485)
33  108  [25,485)
34  108  [25,485)
35  108  [25,485)
36  108  [25,485)
37  108  [25,485)
38  108  [25,485)

Since the classes for each column appears to be like this:

> lapply(df_Ex_binned,class)
$`X`
[1] "numeric"

$Binned_X
[1] "ordered" "factor" 

$Dependent
[1] "numeric"

My approach would be to extract the opening number of the column "Binned_X" (the number between the "(" or "[" and the ",") so I could change it to numeric and then order it (for the purpose of the continuous variable I said before).

Then, I would do a For loop comparing the value in column "Importance" and when there is a difference lower than 0.2, get the value before the "," (for opening value) of the previous row, and the value after the "," (for closing value) an set it to all rows that contain any of those 2 values in "Importance" that was subtracted.

So, as in this approach I would be using 2 For Loops and I know in R that would be a really inefficient way...

I would really appreciate any suggestions


Solution

  • I saw on another question that you used pakage cutr so i'll use it there too, adding solutions with base::cut and Hmisc::cut2 in the end (they give a slightly different output).

    # devtools::install_github("moodymudskipper/cutr")
    library(cutr)
    threshold <- .02
    cutpoints <- with(df[order(df$X),], X[c(TRUE,abs(diff(Importance))>threshold)])
    transform(df,Binned_X = smart_cut(X,cutpoints))
    #       X Binned_X Dependent  Importance
    # 1  -236     -236         0 -0.25131443
    # 2  -236     -236         1 -0.25131443
    # 3  -236     -236         1 -0.25131443
    # 4  -236     -236         1 -0.25131443
    # 5  -236     -236         0 -0.25131443
    # 6  -236     -236         1 -0.25131443
    # 7  -236     -236         0 -0.25131443
    # 8   320 [25,485]         1  0.06713930
    # 9   320 [25,485]         0  0.06713930
    # 10  320 [25,485]         1  0.06713930
    # 11  320 [25,485]         1  0.06713930
    # 12  244 [25,485]         1  0.06713930
    # 13  244 [25,485]         1  0.06713930
    # 14  244 [25,485]         0  0.06713930
    # 15  244 [25,485]         1  0.06713930
    # 16  485 [25,485]         0  0.06713930
    # 17  485 [25,485]         1  0.06713930
    # 18  485 [25,485]         1  0.06713930
    # 19  485 [25,485]         1  0.06713930
    # 20  485 [25,485]         1  0.06713930
    # 21  485 [25,485]         0  0.06713930
    # 22  485 [25,485]         0  0.06713930
    # 23  485 [25,485]         0  0.06713930
    # 24  485 [25,485]         1  0.06713930
    # 25   25 [25,485]         1  0.04879016
    # 26   25 [25,485]         1  0.04879016
    # 27   25 [25,485]         1  0.04879016
    # 28   25 [25,485]         1  0.04879016
    # 29   25 [25,485]         0  0.04879016
    # 30   25 [25,485]         1  0.04879016
    # 31   25 [25,485]         1  0.04879016
    # 32  108 [25,485]         1  0.04879016
    # 33  108 [25,485]         0  0.04879016
    # 34  108 [25,485]         0  0.04879016
    

    Using base::cut :

    cutpoints <- with(df[order(df$X),], c(X[c(TRUE,abs(diff(Importance))>threshold)],max(X)))
    transform(df,Binned_X = cut(X,cutpoints,include.lowest = TRUE,right = FALSE))
    

    Using Hmisc::cut2 :

    library(Hmisc)
    cutpoints <- with(df[order(df$X),], X[c(TRUE,abs(diff(Importance))>threshold)])
    transform(df,Binned_X = cut2(X,cutpoints))