apache-spark machine-learning linear-regression apache-spark-mllib logistic-regression

How to transform the categorical feature

I am kind of new to machine learning, and I am working on a classification/regression problem.

In the dataset, there is a weather feature takes a few categorical values, as: Sunny, Rainy, Windy, Cloudy, etc.

There are two optional ways to transform this feature,

1.Give each category a numeric index, as

date           weather        indexedWeather
2017-11-01      Sunny              0
2017-11-02      Cloudy             1
2017-11-03      Snow               3
2017-11-04      Cloudy             1
2017-11-05      Windy              2
2017-11-06      Sunny              0
2017-11-07      Snow               3
2017-11-08      Cloudy             1

Spark MLLib has an VectorIndexer tranformer to do this task

2.Tranform this feature into a binary vector:

date           weather         indexedWeather
2017-11-01      Sunny              1 0 0 0
2017-11-02      Cloudy             0 1 0 0
2017-11-03      Snow               0 0 1 0
2017-11-04      Cloudy             0 1 0 0
2017-11-05      Windy              0 0 0 1
2017-11-06      Sunny              1 0 0 0
2017-11-07      Snow               0 0 1 0
2017-11-08      Cloudy             0 1 0 0

Spark MLLib doesn't provide a tranformer for this kind of task.

Which one is preferred? It looks that these both two options are used in practice , but in my opinion, I would prefer the second option, but i would hear from you guys's understanding.

Solution

For the second approach, there is actually a transformer in Spark that does it for you: OneHotEncoder. In this case, it should be used together with a StringIndexer, see here for the documentation.

As for which one is more appropriate, since the weather is strictly categorical and you cannot sort them, it's more appropriate to use a binary vectors. This is true in the cases where the algorithm expects continous features and will split the data depending on the values (such as Logistic Regression). If there is no clear rank or sortable order that you want the algorithm to consider, then one-hot encoder should be used.