Search code examples
pythonpandasnlpcosine-similarity

How to check term similarity within a pandas column with similarity.jarowinkler


I would need to check if two or more words in a list are similar. To do this, I am using the Jaro Wrinkler distance as follows:

from similarity.jarowinkler import JaroWinkler

word1='sweet chili'
word2='sriracha chilli'

jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))

It seems to be able to detect the similarity between words, but I would need to set a threshold to select only words that are similar at 80%. My difficulties, however, are in checking all the words within a data frame's column:

Words

sweet chili
sriracha chilli
tomato
mayonnaise 
water
milk
still water
sparkling water
wine
chicken 
beef
...

What I would like to do is: - starting with the first element, check the similarity between this one and the others; if the similarity is greater than a threshold (80%), save it in a new array; - check the second element (sriracha chilli) as above; - and so on.

Could you please tell me how to run such a similar loop?


Solution

    • With the given data
    • Using the strsim package
    • If the real dataframe has many columns, consider making a dataframe with just the Words column
      • new_df = pd.DataFrame({'Words': df.Words})
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from similarity.jarowinkler import JaroWinkler
    import numpy as np
    
    df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})
    
    # call similarity method
    jarowinkler = JaroWinkler()
    
    # remove whitespace
    df.Words = df.Words.str.strip()
    
    # create column of matching values for each word
    words = df.Words.tolist()
    
    for word in words:
        df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))
    
    |    | Words           |   sweet chili |   sriracha chilli |   tomato |   mayonnaise |    water |     milk |   still water |   sparkling water |     wine |   chicken |     beef |
    |---:|:----------------|--------------:|------------------:|---------:|-------------:|---------:|---------:|--------------:|------------------:|---------:|----------:|---------:|
    |  0 | sweet chili     |      1        |          0.605772 | 0.419192 |     0.39697  | 0.513131 | 0        |      0.515152 |          0.460101 | 0.560606 |  0.322511 | 0.560606 |
    |  1 | sriracha chilli |      0.605772 |          1        | 0.411111 |     0.388889 | 0.344444 | 0.438889 |      0.460101 |          0.488889 | 0.438889 |  0.529365 | 0        |
    |  2 | tomato          |      0.419192 |          0.411111 | 1        |     0.488889 | 0.411111 | 0.472222 |      0.590909 |          0.411111 | 0        |  0        | 0        |
    |  3 | mayonnaise      |      0.39697  |          0.388889 | 0.488889 |     1        | 0.433333 | 0.45     |      0.460606 |          0.544444 | 0.45     |  0.328571 | 0        |
    |  4 | water           |      0.513131 |          0.344444 | 0.411111 |     0.433333 | 1        | 0        |      0.430303 |          0.511111 | 0.633333 |  0.447619 | 0.483333 |
    |  5 | milk            |      0        |          0.438889 | 0.472222 |     0.45     | 0        | 1        |      0.560606 |          0.538889 | 0.5      |  0.595238 | 0        |
    |  6 | still water     |      0.515152 |          0.460101 | 0.590909 |     0.460606 | 0.430303 | 0.560606 |      1        |          0.749854 | 0.44697  |  0.489177 | 0        |
    |  7 | sparkling water |      0.460101 |          0.488889 | 0.411111 |     0.544444 | 0.511111 | 0.538889 |      0.749854 |          1        | 0.544444 |  0.431746 | 0        |
    |  8 | wine            |      0.560606 |          0.438889 | 0        |     0.45     | 0.633333 | 0.5      |      0.44697  |          0.544444 | 1        |  0.595238 | 0.5      |
    |  9 | chicken         |      0.322511 |          0.529365 | 0        |     0.328571 | 0.447619 | 0.595238 |      0.489177 |          0.431746 | 0.595238 |  1        | 0        |
    | 10 | beef            |      0.560606 |          0        | 0        |     0        | 0.483333 | 0        |      0        |          0        | 0.5      |  0        | 1        |
    

    see values greater than 80%

    • none except the exact matching values
    df.set_index('Words', inplace=True)
    
    np.where(df[words] > 0.8, df[words], np.nan)
    
    array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
           [nan,  1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan,  1., nan, nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan,  1., nan, nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan,  1., nan, nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan,  1., nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan,  1., nan, nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan,  1., nan, nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan,  1., nan, nan],
           [nan, nan, nan, nan, nan, nan, nan, nan, nan,  1., nan],
           [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,  1.]])
    

    add a heatmap

    mask = np.zeros_like(df[words])
    mask[np.triu_indices_from(mask)] = True
    with sns.axes_style("white"):
        f, ax = plt.subplots(figsize=(7, 5))
        ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")
    

    enter image description here