Search code examples
pythondataframepyspark

PySpark get_dummies equivalent


I have a pyspark dataframe with the following schema:

Key1 Key2 Key3 Value
a a a "value1"
a a a "value2"
a a b "value1"
b b a "value2"

(In real life this dataframe is extremely large, not reasonable to convert to pandas DF)

My goal is to transform the dataframe to look like so:

Key1 Key2 Key3 value1 value2
a a a 1 1
a a b 1 0
b b a 0 1

I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use.

It is worth mentioning that column Value can receive (in this example) only the values "value1" and "value2" I have encountered this question that possibly solves my problem but I do not entirely understand it and was wondering if there was a simpler way to solve the problem.
Any help is greatly appreciated!

SMALL EDIT

After implementing the accepted solution, to turn this into a one-hot encoding and not just a sum of appearances, I converted each column to boolean type and then back to integer.


Solution

  • You can group by on the key columns and pivot the value column while counting all records.

    data_sdf. \
        groupBy('key1', 'key2', 'key3'). \
        pivot('val'). \
        agg(func.count('*')). \
        fillna(0). \
        show()
    
    # +----+----+----+------+------+
    # |key1|key2|key3|value1|value2|
    # +----+----+----+------+------+
    # |   b|   b|   a|     0|     1|
    # |   a|   a|   a|     1|     1|
    # |   a|   a|   b|     1|     0|
    # +----+----+----+------+------+