Search code examples
apache-sparkpysparkbinarypivotmultiple-columns

How to map one column to multiple binary columns in Spark?


This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:

+-----------------+
| C1     | C2     |
+--------|--------+
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
+-----------------+

to

+--------------------------------------------+
| C1     | www    | xxx    | yyy    | zzz    |
+--------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
+--------------------------------------------+

How does one attain this in PySpark? Presence is 1 and absence is 0.


Solution

  • Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [('A', 'xxx'),
         ('B', 'yyy'),
         ('A', 'yyy'),
         ('B', 'www'),
         ('B', 'xxx'),
         ('A', 'zzz'),
         ('A', 'xxx'),
         ('A', 'yyy')],
        ['C1', 'C2'])
    
    df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
    df = df.fillna(0)
    
    df.show()
    # +---+---+---+---+---+
    # | C1|www|xxx|yyy|zzz|
    # +---+---+---+---+---+
    # |  B|  1|  1|  1|  0|
    # |  A|  0|  1|  1|  1|
    # +---+---+---+---+---+