Search code examples
pythonrandom-forestsklearn-pandas

Discretizing continuous variables for RandomForest in Sklearn


I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset

Var1 Var2
198 zcROj17IEC 336 DHeTmBftjz 252.3 crIgUHSK8h 252 ZSNrjIX0Db

I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones

pandas.get_dummies(X['Var2'])

and it works but for the numeric I tried the following to discretize

pandas.qcut(X['Var1'], 2 , retbins=True) 

but I keep getting an error of non-unique bins!

Do I need to discretize? How can I do it?


Solution

  • Random forest should support continuous variables no problem. See for example this sample.