I have a dataset with zipcode column. They have some significance in output and I want to use it as a feature. I am using random forest model.
I need a suggestions on best way to use zipcode column as a feature. (For example should I get lat/long for that zipcode rather than directly feeding zipcodes etc.)
Thanks in advance !!
A common way of handling zip codes or any high cardinality categorical column is called "target encoding" or "impact encoding". In H2O, you can apply target encoding to any categorical columns. As of H2O 3.20, this is only available in R, but in the next stable release, 3.22, it will be available in all clients (JIRA ticket here).
If you are using R, my advice is to try both target encoding and also the GLRM method mentioned by Lauren and compare the results. If you're in Python or another language, then try GLRM for now and give target encoding a try when H2O 3.22 is released.