Search code examples
apache-sparkpysparkcontingency

Pyspark: reshape data without aggregation


I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:

columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
    (1, 0, 141),
    (0, 0, 140),
    (1, 1, 21),
    (0, 1, 12)
]

What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:

columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
    (1, 21, 141),
    (0, 12, 140)
]

Solution

  • You can use pivot with a fake maximum aggregation (since you have only one element for each group):

    import pyspark.sql.functions as F
    df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
        'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
    ).show()
    +------+------------+------------+
    |FAULTY|value_high_1|value_high_0|
    +------+------------+------------+
    |     0|          12|         140|
    |     1|          21|         141|
    +------+------------+------------+