Search code examples
pythonpysparkone-hot-encoding

Pyspark one-hot encoding with grouping same id


Is there a way to perform OHE in Spark and 'flatten' dataset so that each Id has only one row?

For example if input is like this:

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  1|       a|
|  2|       a|
|  0|       c|
+---+--------+

Output should be like this (id0 has categories a and c, id1 has a and b, etc.):

+---+----------+----------+----------+
| id|category_a|category_c|category_b|
+---+----------+----------+----------+
|  0|         1|         1|         0|
|  1|         1|         0|         1|
|  2|         1|         1|         0|
+---+----------+----------+----------+

I can do this in pandas by OHE + groupby (aggr - 'max'), but can't find a way to do it in pyspark due to the specific output format..

Thank you, appreciate any help.


Solution

  • indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
    encoder = OneHotEncoder(inputCols=["categoryIndex"], outputCols=["categoryVec"])
    
    pipeline = Pipeline(stages=[indexer, encoder])
    model = pipeline.fit(df)
    transformed_df = model.transform(df)
    
    result = transformed_df.groupBy("id").pivot("category").agg(count("categoryVec"))
    
    result.show()
    

    Converting the values to indices using StringIndexer, applying the OHE and then pivoting around id and lastly aggregating everything together.

    Changed from max to count as per your ask