Is the a quantitative descriptor of similarity between two words based on how they sound/are pronounced, analogous to Levenshtein distance?
I know soundex gives same id to similar sounding words, but as far as I undestood it is not a quantitative descriptor of difference between the words.
from jellyfish import soundex
print(soundex("two"))
print(soundex("to"))
You could combine phonetic encoding and string comparison algorithm. As a matter of fact jellyfish
supplies both.
Setting up the libraries examples
from jellyfish import soundex, metaphone, nysiis, match_rating_codex,\
levenshtein_distance, damerau_levenshtein_distance, hamming_distance,\
jaro_similarity
from itertools import groupby
import pandas as pd
import numpy as np
dataList = ['two','too','to','fourth','forth','dessert',
'desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']
sounds_encoding_methods = [soundex, metaphone, nysiis, match_rating_codex]
Let compare different phonetic encoding
report = pd.DataFrame([dataList]).T
report.columns = ['word']
for i in sounds_encoding_methods:
print(i.__name__)
report[i.__name__]= report['word'].apply(lambda x: i(x))
print(report)
soundex metaphone nysiis match_rating_codex
word
two T000 TW TW TW
too T000 T T T
to T000 T T T
fourth F630 FR0 FART FRTH
forth F630 FR0 FART FRTH
dessert D263 TSRT DASAD DSRT
desert D263 TSRT DASAD DSRT
Byrne B650 BRN BYRN BYRN
Boern B650 BRN BARN BRN
Smith S530 SM0 SNAT SMTH
Smyth S530 SM0 SNYT SMYTH
Catherine C365 K0RN CATARAN CTHRN
Kathryn K365 K0RN CATRYN KTHRYN
You can see that phonetic encoding is doing a pretty good job making comparable the words. You could see different cases and prefer one or other depending on your case.
Now I will just take the above and try to find the closest match using levenshtein_distance, but I could you any other too.
"""Select the closer by algorithm
for instance levenshtein_distance"""
report2 = pd.DataFrame([dataList]).T
report2.columns = ['word']
report.set_index('word',inplace=True)
report2 = report.copy()
for sounds_encoding in sounds_encoding_methods:
report2[sounds_encoding.__name__] = np.nan
matched_words = []
for word in dataList:
closest_list = []
for word_2 in dataList:
if word != word_2:
closest = {}
closest['word'] = word_2
closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],
report.loc[word_2,sounds_encoding.__name__])
closest_list.append(closest)
report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
sort_values(by = 'similarity').head(1)['word'].values[0]
print(report2)
soundex metaphone nysiis match_rating_codex
word
two too too too too
too two to to to
to two too too too
fourth forth forth forth forth
forth fourth fourth fourth fourth
dessert desert desert desert desert
desert dessert dessert dessert dessert
Byrne Boern Boern Boern Boern
Boern Byrne Byrne Byrne Byrne
Smith Smyth Smyth Smyth Smyth
Smyth Smith Smith Smith Smith
Catherine Kathryn Kathryn Kathryn Kathryn
Kathryn Catherine Catherine Catherine Catherine
As from above you could clearly see that combinations between phonetic encoding and string comparison algorithm can be very straight forward.