I have a Spark dataframe spdf
with data that looks as follows:
player_name team_history
John [{Rangers, Center, Active}, {Blackhawks, Center, Former}, {Kings, Center, Former}],
Bob [{Devils, Defense, Active}, {Maple Leafs, Defense, Former}, {Canadiens, Defense, Former}]
The schema is:
hockey_schema = StructType([
StructField("player_name", StringType(), True),
StructField("team_history", ArrayType(
StructType([
StructField("team", StringType(), True),
StructField("position", StringType(), True),
StructField("status", StringType(), True),
])), True)
])
The JSON looks like this:
[{ "player_name" : "John", "team_history" : [ { "team" : "Rangers", "position" : "Center", "status" : "Active" }, { "team" : "Blackhawks", "position" : "Center", "status" : "Former"}, { "team" : "Kings", "position" : "Center", "status" : "Former"} ] },
{ "player_name" : "Bob", "team_history" : [ { "team" : "Devils", "position" : "Defense", "status" : "Active" }, { "team" : "Maple Leafs", "position" : "Defence", "status" : "Former"}, { "team" : "Canadiens", "position" : "Defense", "status" : "Former"} ] }]
I would like to "explode" the contents of the team_history
column to create a new dataframe named df_exploded
with columns for only team
and status
that looks like this:
team status
Rangers Active
Blackhawks Former
Kings Former
Devils Active
Maple Leafs Former
Canadiens Former
How would I create the desired df_exploded
dataframe using the explode()
function in Pyspark?
Thanks!
It seems that using explode
and extracting the values you are interested in from the struct would do the trick:
df\
.select(F.explode("team_history").alias("s"))\
.select("s.team", "s.status")\
.show()
+-----------+------+
|team |status|
+-----------+------+
|Rangers |Active|
|Blackhawks |Former|
|Kings |Former|
|Devils |Active|
|Maple Leafs|Former|
|Canadiens |Former|
+-----------+------+