Search code examples
pysparkgroup-byaggregate

How to use join text with group by in Pyspark?


I have a pyspark dataframe

id events
a0 a-markets-l1
a0 a-markets-watch
a0 a-markets-buy
c7 a-markets-z2
c7 scroll_down
a0 a-markets-sell
b2 next_screen

I am trying to join events by grouping IDs Here's my python code

df_events_userpath = df_events.groupby('id').agg({ 'events': lambda x: ' '.join(x)}).reset_index()

id events
a0 a-markets-l1 a-markets-watch a-markets-buy a-markets-sell
c7 a-markets-z2 scroll_down
b2 next_screen

Solution

  • I have tried using collect_set

    df.groupBy("id").agg(f.collect_set("events").alias("events"))