I've got a dataframe of roles
and the ids
of people who play those roles. In the table below, the roles are a,b,c,d
and the people are a3,36,79,38
.
What I want is a map of people to an array of their roles, as shown to the right of the table.
+---+----+----+---+---+--------+
|rec| a| b| c| d| ppl | pplmap
+---+----+----+---+---+--------+-------------------------------------
| D| a3| 36| 36| 36|[a3, 36]| [ a3 -> ['a'], 36 -> ['b','c','d'] ]
| E| a3| 79| 79| a3|[a3, 79]| [ a3 -> ['a','d'], 79 -> ['b','c'] ]
| F|null|null| 38| 38| [38]| [ 38 -> ['c','d'] ]
+---+----+----+---+---+--------+
And, actually, what I really want is a nicely readable report, like:
D
a3 roles: a
36 roles: b, c, d
E
a3 roles: a, d
79 roles: b, c
F
38 roles: c, d
I'm using PySpark 3.
Any suggestions? Thank you!!
You can, first, unpivot the dataframe then using some groupby you can construct the map column you want.
Input dataframe:
data = [
("D", "a3", "36", "36", "36", ["a3", "36"]),
("E", "a3", "79", "79", "a3", ["a3", "79"]),
("F", None, None, "38", "38", ["38"]),
]
df = spark.createDataFrame(data, ["id", "a", "b", "c", "d", "ppl"])
Using stack
function to unpivot and map_from_entries
after grouping:
import pyspark.sql.functions as F
df1 = df.selectExpr(
"id",
"stack(4, 'a', a, 'b', b, 'c', c, 'd', d) as (role, person)"
).filter(
"person is not null"
).groupBy("id", "person").agg(
F.collect_list("role").alias("roles")
).groupBy("id").agg(
F.map_from_entries(
F.collect_list(F.struct(F.col("person"), F.col("roles")))
).alias("pplmap")
)
df1.show(truncate=False)
#+---+----------------------------+
#|id |pplmap |
#+---+----------------------------+
#|F |{38 -> [c, d]} |
#|E |{79 -> [b, c], a3 -> [a, d]}|
#|D |{a3 -> [a], 36 -> [b, c, d]}|
#+---+----------------------------+
If you want to dynamically generate the stack expression (in case you have many role columns), you can see my other answer here.