While using H2O DAI to build models, I noticed that in the end model there would be some correlated variables. For instance, variables "max number of saving accounts in past 9 months" and "max number of saving accounts in past 3 months" both show up in the final model, but they are having a high correlation. Understand there are ways we can check this prior to feeding the data for H2O DAI, but I am wondering if there is some settings or good way to let H2O DAI check variable multicollinearity automatically while selecting features to build models?
Thanks for the help in advance.
If you want to look at correlated features and manually remove them before building a model. Go to the Autoviz section and look at the Correlated Scatterplots then drop those columns from the experiment or dataset.
Removing collinear features is difficult for any modeling since you won't know which feature would be better than the other. What if having both "max number of saving accounts in past 9 months" and "max number of saving accounts in past 3 months" make your model perform much better than only having one? This is where domain knowledge becomes important, and the expert should decide.
One way to remove some collinearity is limiting the number of features your model has. You can use max_orig_cols_selected
to limit the number. You can set it in expert settings or config.toml (see for more info). But as I said before, it's hard to determine whether some collinear features should be kepts over others.
Another option is to use algorithms/models that inherently do feature selection, like L1 (LASSO) regression.