Search code examples
pythonpysparkcountgrouping

Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?


I'm working on a project where I have a pyspark dataframe of two columns (string, string count) that are string and bigint respectively. The dataset is dirty such that some words have a non-letter character attached to them (ex. 'date', '_date', '!date' and 'date,' are all separate items but should be just 'date')

print(dirty_df.schema)
output---> StructType([StructField('count', LongType(), True), StructField('word', StringType(), True)])
dirty_df.show()
+------+------+
| count|  word|
+------+------+
|32375 |  date|
|359   | _date|
|306   | !date|
|213   | date,|
|209   |  snap|
|204   | ^snap|
|107   | +snap|
|12    | snap?|

I need to reduce the dataframe such that date, _date, !date, and date, are all just 'date' with their counts being updated to match. Problem is: I need to avoid joining on similar words like'dates', 'dating', 'dated', 'todate', etc.

Goal

+------+------+
| count|  word|
+------+------+
|33253 |  date|
|532   |  snap|

Any thoughts on how I could approach this?


Solution

  • Use regexp_replace function and replace all special characters([^a-zA-Z] replace all characters other than alphabets).

    Example:

    df = spark.createDataFrame([(32375,'date'),(359,'_date'),(306,'[date'),(213,'date]'),(209,'snap'),(204,'_snap'),(107,'[snap'),(12,'snap]')],['count','word'])
    df.withColumn("word",regexp_replace(col("word"),"[^a-zA-Z]","")).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)
    #+----+-----+
    #|word|count|
    #+----+-----+
    #|date|33253|
    #|snap|532  |
    #+----+-----+
    

    Other way:

    If you want to replace only specific characters then use translate function

    df.withColumn("word",expr('translate(word,"(_|]|[)","")')).groupBy("word").agg(sum(col("count")).alias("count")).show(10,False)
    
    #+----+-----+
    #|word|count|
    #+----+-----+
    #|date|33253|
    #|snap|532  |
    #+----+-----+