python apache-spark pyspark apache-spark-sql melt

PySpark MapType from column values to array of column name

I've got a dataframe of roles and the ids of people who play those roles. In the table below, the roles are a,b,c,d and the people are a3,36,79,38.

What I want is a map of people to an array of their roles, as shown to the right of the table.

+---+----+----+---+---+--------+
|rec|   a|   b|  c|  d|    ppl |  pplmap
+---+----+----+---+---+--------+-------------------------------------
|  D|  a3|  36| 36| 36|[a3, 36]| [ a3 -> ['a'], 36 -> ['b','c','d'] ]  
|  E|  a3|  79| 79| a3|[a3, 79]| [ a3 -> ['a','d'], 79 -> ['b','c'] ]
|  F|null|null| 38| 38|    [38]| [ 38 -> ['c','d'] ]
+---+----+----+---+---+--------+

And, actually, what I really want is a nicely readable report, like:

D
  a3 roles: a
  36 roles: b, c, d
E
  a3 roles: a, d
  79 roles: b, c
F
  38 roles: c, d

I'm using PySpark 3.

Any suggestions? Thank you!!

Solution

You can, first, unpivot the dataframe then using some groupby you can construct the map column you want.

Input dataframe:

data = [
    ("D", "a3", "36", "36", "36", ["a3", "36"]),
    ("E", "a3", "79", "79", "a3", ["a3", "79"]),
    ("F", None, None, "38", "38", ["38"]),
]

df = spark.createDataFrame(data, ["id", "a", "b", "c", "d", "ppl"])

Using stack function to unpivot and map_from_entries after grouping:

import pyspark.sql.functions as F

df1 = df.selectExpr(
    "id",
    "stack(4, 'a', a, 'b', b, 'c', c, 'd', d) as (role, person)"
).filter(
    "person is not null"
).groupBy("id", "person").agg(
    F.collect_list("role").alias("roles")
).groupBy("id").agg(
    F.map_from_entries(
        F.collect_list(F.struct(F.col("person"), F.col("roles")))
    ).alias("pplmap")
)

df1.show(truncate=False)
#+---+----------------------------+
#|id |pplmap                      |
#+---+----------------------------+
#|F  |{38 -> [c, d]}              |
#|E  |{79 -> [b, c], a3 -> [a, d]}|
#|D  |{a3 -> [a], 36 -> [b, c, d]}|
#+---+----------------------------+

If you want to dynamically generate the stack expression (in case you have many role columns), you can see my other answer here.