Is there a way to perform OHE in Spark and 'flatten' dataset so that each Id has only one row?
For example if input is like this:
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 1| a|
| 2| a|
| 0| c|
+---+--------+
Output should be like this (id0 has categories a
and c
, id1 has a
and b
, etc.):
+---+----------+----------+----------+
| id|category_a|category_c|category_b|
+---+----------+----------+----------+
| 0| 1| 1| 0|
| 1| 1| 0| 1|
| 2| 1| 1| 0|
+---+----------+----------+----------+
I can do this in pandas by OHE + groupby (aggr - 'max'), but can't find a way to do it in pyspark due to the specific output format..
Thank you, appreciate any help.
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCols=["categoryIndex"], outputCols=["categoryVec"])
pipeline = Pipeline(stages=[indexer, encoder])
model = pipeline.fit(df)
transformed_df = model.transform(df)
result = transformed_df.groupBy("id").pivot("category").agg(count("categoryVec"))
result.show()
Converting the values to indices using StringIndexer, applying the OHE and then pivoting around id and lastly aggregating everything together.
Changed from max to count as per your ask