I have a pyspark dataframe with a categorical column that is being converted into a onehot encoded vector via...
si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)
when looking at the dataframe after, I see some of the onehot encoded vectors looking like...
(1,[],[])
I would expect the sparse vectors to either look like (1,[0],[1.0])
or (1,[1],[1.0])
, but here the vectors are just zeros.
Any idea what could be happening here?
This has to do with how the values are encoded in mllib. The 1hot is not encoding the binary value like...
[1, 0] or [0, 1]
in a [this, that] fashion but rather
[1] or [0]
In the sparse vector format the [0] case looks like (1,[],[])
, meaning length=1, no position indexes have nonzero, and (thus) no nonzero values to list (can see more about how mllib represents sparse vectors here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article on encoding...
One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available
If you don't want the onehot encoder to drop the last category to simplify the representation, the mllib class a dropLast param you can set, see https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator