Search code examples
scalaapache-sparkmachine-learninglogistic-regressionapache-spark-mllib

How to transform category variable to dummy/indicator variable in MLBase


I am trying to use logistic regression model in MLBase to predict CTR of Ad. In my dataset I have some category variables and I want to transform them to dummy/indicator variables used as input of model. My data looks like

"log_time","country","gender"
"2015-05-19","USA","M"
"2015-05-20","IND","F"

Are there some solution to complete the transformation in MLBase or scala?


Solution

  • What you're looking for is called one hot encoding.

    Spark's MLlib has a one hot encoder which can do this for you.