Search code examples
sqlpysparksplitpandas-explode

In PySpark, how do I get word frequency in a column when a row can contain multiple words?


Assume a two column PySpark DataFrame with 3 rows:

["Number"]     [ "Keywords"}

1              Mary had a little lamb

2              A little lamb is white

3              Mary is little

Desired output:

little 3

Mary   2

lamb   2

is     2

a      2

had    1

white  1

Tried "explode" and "split", but could not get the syntax right.


Solution

  • You can try below code -

    from pyspark.sql import functions as F
    from pyspark.sql.functions import explode, split
    
    
    df = df.withColumn("Keyword", explode(split(F.col("Keywords"), " ")))
    
    keyword_counts = df.withColumn("Keyword", F.lower(F.col("Keyword"))).groupBy("Keyword").count()
    
    keyword_counts = keyword_counts.orderBy(F.col("count").desc())