Search code examples
pythonapache-sparkpysparkapache-spark-sqlmelt

PySpark MapType from column values to array of column name


I've got a dataframe of roles and the ids of people who play those roles. In the table below, the roles are a,b,c,d and the people are a3,36,79,38.

What I want is a map of people to an array of their roles, as shown to the right of the table.

+---+----+----+---+---+--------+
|rec|   a|   b|  c|  d|    ppl |  pplmap
+---+----+----+---+---+--------+-------------------------------------
|  D|  a3|  36| 36| 36|[a3, 36]| [ a3 -> ['a'], 36 -> ['b','c','d'] ]  
|  E|  a3|  79| 79| a3|[a3, 79]| [ a3 -> ['a','d'], 79 -> ['b','c'] ]
|  F|null|null| 38| 38|    [38]| [ 38 -> ['c','d'] ]
+---+----+----+---+---+--------+

And, actually, what I really want is a nicely readable report, like:

D
  a3 roles: a
  36 roles: b, c, d
E
  a3 roles: a, d
  79 roles: b, c
F
  38 roles: c, d

I'm using PySpark 3.

Any suggestions? Thank you!!


Solution

  • You can, first, unpivot the dataframe then using some groupby you can construct the map column you want.

    Input dataframe:

    data = [
        ("D", "a3", "36", "36", "36", ["a3", "36"]),
        ("E", "a3", "79", "79", "a3", ["a3", "79"]),
        ("F", None, None, "38", "38", ["38"]),
    ]
    
    df = spark.createDataFrame(data, ["id", "a", "b", "c", "d", "ppl"])
    

    Using stack function to unpivot and map_from_entries after grouping:

    import pyspark.sql.functions as F
    
    df1 = df.selectExpr(
        "id",
        "stack(4, 'a', a, 'b', b, 'c', c, 'd', d) as (role, person)"
    ).filter(
        "person is not null"
    ).groupBy("id", "person").agg(
        F.collect_list("role").alias("roles")
    ).groupBy("id").agg(
        F.map_from_entries(
            F.collect_list(F.struct(F.col("person"), F.col("roles")))
        ).alias("pplmap")
    )
    
    df1.show(truncate=False)
    #+---+----------------------------+
    #|id |pplmap                      |
    #+---+----------------------------+
    #|F  |{38 -> [c, d]}              |
    #|E  |{79 -> [b, c], a3 -> [a, d]}|
    #|D  |{a3 -> [a], 36 -> [b, c, d]}|
    #+---+----------------------------+
    

    If you want to dynamically generate the stack expression (in case you have many role columns), you can see my other answer here.