Search code examples
dataframeapache-sparkpysparkmatchingfuzzywuzzy

Create new column with fuzzy-score across two string columns in the same dataframe


I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.

| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher

It would need to look something like this:

| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|

With the help of another question on this website, I worked towards the following code:

compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()

def metrics (tup):
   return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])

df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)

But the problem already starts with the first line of the code that returns the following error notification:

Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

You can say I'm kind of stuck here so any advice on this is appreciated!


Solution

  • You need a UDF to use fuzzywuzzy:

    from fuzzywuzzy import fuzz
    import pyspark.sql.functions as F
    
    @F.udf
    def fuzzyudf(original_title, title):
        return fuzz.partial_ratio(original_title, title)
    
    df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
    df2.show()
    
    +-----------+-----------+-------------+
    |    column1|    column2|partial_ratio|
    +-----------+-----------+-------------+
    |emmett holt|       holt|          100|
    |  greenwald|christopher|           22|
    +-----------+-----------+-------------+