Given a dataframe df that looks like this after using a cut:
X Binned_X Dependent Importance
1 -236 -236 0 -0.25131443
2 -236 -236 1 -0.25131443
3 -236 -236 1 -0.25131443
4 -236 -236 1 -0.25131443
5 -236 -236 0 -0.25131443
6 -236 -236 1 -0.25131443
7 -236 -236 0 -0.25131443
8 320 [244,485] 1 0.06713930
9 320 [244,485] 0 0.06713930
10 320 [244,485] 1 0.06713930
11 320 [244,485] 1 0.06713930
12 244 [244,485] 1 0.06713930
13 244 [244,485] 1 0.06713930
14 244 [244,485] 0 0.06713930
15 244 [244,485] 1 0.06713930
16 485 [244,485] 0 0.06713930
17 485 [244,485] 1 0.06713930
18 485 [244,485] 1 0.06713930
19 485 [244,485] 1 0.06713930
20 485 [244,485] 1 0.06713930
21 485 [244,485] 0 0.06713930
22 485 [244,485] 0 0.06713930
23 485 [244,485] 0 0.06713930
24 485 [244,485] 1 0.06713930
25 25 [25,244) 1 0.04879016
26 25 [25,244) 1 0.04879016
27 25 [25,244) 1 0.04879016
28 25 [25,244) 1 0.04879016
29 25 [25,244) 0 0.04879016
30 25 [25,244) 1 0.04879016
31 25 [25,244) 1 0.04879016
32 108 [25,244) 1 0.04879016
33 108 [25,244) 0 0.04879016
34 108 [25,244) 0 0.04879016
35 108 [25,244) 0 0.04879016
36 108 [25,244) 1 0.04879016
37 108 [25,244) 1 0.04879016
38 108 [25,244) 0 0.04879016
I would like to replace the column "Binned_X" if the difference of the value in the column "Importance" between bins is lower than N (let's say 0.2)
It is important to say that since X is a continuous variable, if we want to merge some bins, they need to be right next in order (example, if we have bins "1, 2 and 3", "1" could only merge with "2", "2" could merge with "1" or "3", and "3" could only merge with "2"). So in this particular example, bin "-236" could only merge with "[25,244)", while bin "[25,244)" could merge with "-236" and "[244,485]" and so on.
The desired output would be something like this:
X Binned_X
1 -236 -236
2 -236 -236
3 -236 -236
4 -236 -236
5 -236 -236
6 -236 -236
7 -236 -236
8 320 [25,485]
9 320 [25,485]
10 320 [25,485]
11 320 [25,485]
12 244 [25,485]
13 244 [25,485]
14 244 [25,485]
15 244 [25,485]
16 485 [25,485]
17 485 [25,485]
18 485 [25,485]
19 485 [25,485]
20 485 [25,485]
21 485 [25,485]
22 485 [25,485]
23 485 [25,485]
24 485 [25,485]
25 25 [25,485)
26 25 [25,485)
27 25 [25,485)
28 25 [25,485)
29 25 [25,485)
30 25 [25,485)
31 25 [25,485)
32 108 [25,485)
33 108 [25,485)
34 108 [25,485)
35 108 [25,485)
36 108 [25,485)
37 108 [25,485)
38 108 [25,485)
Since the classes for each column appears to be like this:
> lapply(df_Ex_binned,class)
$`X`
[1] "numeric"
$Binned_X
[1] "ordered" "factor"
$Dependent
[1] "numeric"
My approach would be to extract the opening number of the column "Binned_X" (the number between the "(" or "[" and the ",") so I could change it to numeric and then order it (for the purpose of the continuous variable I said before).
Then, I would do a For loop comparing the value in column "Importance" and when there is a difference lower than 0.2, get the value before the "," (for opening value) of the previous row, and the value after the "," (for closing value) an set it to all rows that contain any of those 2 values in "Importance" that was subtracted.
So, as in this approach I would be using 2 For Loops and I know in R that would be a really inefficient way...
I would really appreciate any suggestions
I saw on another question that you used pakage cutr
so i'll use it there too, adding solutions with base::cut
and Hmisc::cut2
in the end (they give a slightly different output).
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
threshold <- .02
cutpoints <- with(df[order(df$X),], X[c(TRUE,abs(diff(Importance))>threshold)])
transform(df,Binned_X = smart_cut(X,cutpoints))
# X Binned_X Dependent Importance
# 1 -236 -236 0 -0.25131443
# 2 -236 -236 1 -0.25131443
# 3 -236 -236 1 -0.25131443
# 4 -236 -236 1 -0.25131443
# 5 -236 -236 0 -0.25131443
# 6 -236 -236 1 -0.25131443
# 7 -236 -236 0 -0.25131443
# 8 320 [25,485] 1 0.06713930
# 9 320 [25,485] 0 0.06713930
# 10 320 [25,485] 1 0.06713930
# 11 320 [25,485] 1 0.06713930
# 12 244 [25,485] 1 0.06713930
# 13 244 [25,485] 1 0.06713930
# 14 244 [25,485] 0 0.06713930
# 15 244 [25,485] 1 0.06713930
# 16 485 [25,485] 0 0.06713930
# 17 485 [25,485] 1 0.06713930
# 18 485 [25,485] 1 0.06713930
# 19 485 [25,485] 1 0.06713930
# 20 485 [25,485] 1 0.06713930
# 21 485 [25,485] 0 0.06713930
# 22 485 [25,485] 0 0.06713930
# 23 485 [25,485] 0 0.06713930
# 24 485 [25,485] 1 0.06713930
# 25 25 [25,485] 1 0.04879016
# 26 25 [25,485] 1 0.04879016
# 27 25 [25,485] 1 0.04879016
# 28 25 [25,485] 1 0.04879016
# 29 25 [25,485] 0 0.04879016
# 30 25 [25,485] 1 0.04879016
# 31 25 [25,485] 1 0.04879016
# 32 108 [25,485] 1 0.04879016
# 33 108 [25,485] 0 0.04879016
# 34 108 [25,485] 0 0.04879016
Using base::cut
:
cutpoints <- with(df[order(df$X),], c(X[c(TRUE,abs(diff(Importance))>threshold)],max(X)))
transform(df,Binned_X = cut(X,cutpoints,include.lowest = TRUE,right = FALSE))
Using Hmisc::cut2
:
library(Hmisc)
cutpoints <- with(df[order(df$X),], X[c(TRUE,abs(diff(Importance))>threshold)])
transform(df,Binned_X = cut2(X,cutpoints))