Search code examples
dictionaryapache-sparkpysparkpyspark-schema

Is there a way to store a dictionary as a column value in pyspark?


I have two spark dfs containing different number of columns with the 1st column being the ID (for both). I want to have the column values for each ID as a dictionary (the visual would make better sense of what I am trying to achieve)

what I have:

The result I am trying to get

What I tried but couldn't get the result: => [row.asDict() for row in df.collect()] to get a list of dictionaries but could not find a way to add each as column value.

PS: I understand my question could be XYProblem but, since, I do not know all the concepts available in pyspark/python, this is the best way I could think of to present the issue.


Solution

  • Assuming that each table only has one row per ID, you can create a MapType column for each of the table's fields using create_map, then inner join the two tables. Here is a smaller example:

    table1 = spark.createDataFrame(
        [("1", 34, 45), ("2", 78, 89)], 
        ["ID","col1","col2"]
    )
    
    table2 = spark.createDataFrame(
        [("1", 43, 54), ("2", 11, 12)], 
        ["ID","col1","col2"]
    )
    
    table1 = table1.withColumn("table1_cols", F.create_map(
        F.lit("col1"), F.col("col1"), F.lit("col2"), F.col("col2")
    )
    table2 = table2.withColumn("table2_cols", F.create_map(
        F.lit("col1"), F.col("col1"), F.lit("col2"), F.col("col2")
    )
    
    table1.join(table2, on=["ID"], how='inner').select(
        'ID','table1_cols','table2_cols'
    )
    

    Here is the resulting table and schema:

    +---+------------------------+------------------------+
    |ID |table1_cols             |table2_cols             |
    +---+------------------------+------------------------+
    |1  |{col1 -> 34, col2 -> 45}|{col1 -> 43, col2 -> 54}|
    |2  |{col1 -> 78, col2 -> 89}|{col1 -> 11, col2 -> 12}|
    +---+------------------------+------------------------+
    
    root
     |-- ID: string (nullable = true)
     |-- table1_cols: map (nullable = false)
     |    |-- key: string
     |    |-- value: long (valueContainsNull = true)
     |-- table2_cols: map (nullable = false)
     |    |-- key: string
     |    |-- value: long (valueContainsNull = true)