Search code examples
pythonaudionlplinguistics

Distance between strings by similarity of sound


Is the a quantitative descriptor of similarity between two words based on how they sound/are pronounced, analogous to Levenshtein distance?

I know soundex gives same id to similar sounding words, but as far as I undestood it is not a quantitative descriptor of difference between the words.

from jellyfish import soundex

print(soundex("two"))
print(soundex("to"))

Solution

  • You could combine phonetic encoding and string comparison algorithm. As a matter of fact jellyfish supplies both.

    Setting up the libraries examples

    from jellyfish import soundex, metaphone, nysiis, match_rating_codex,\
        levenshtein_distance, damerau_levenshtein_distance, hamming_distance,\
        jaro_similarity
    from itertools import groupby
    import pandas as pd
    import numpy as np
    
    
    dataList = ['two','too','to','fourth','forth','dessert',
                'desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']
    
    sounds_encoding_methods = [soundex, metaphone, nysiis, match_rating_codex]
    

    Let compare different phonetic encoding

    report = pd.DataFrame([dataList]).T
    report.columns = ['word']
    for i in sounds_encoding_methods:
        print(i.__name__)
        report[i.__name__]= report['word'].apply(lambda x: i(x))
    print(report)
              soundex metaphone   nysiis match_rating_codex
    word                                                   
    two          T000        TW       TW                 TW
    too          T000         T        T                  T
    to           T000         T        T                  T
    fourth       F630       FR0     FART               FRTH
    forth        F630       FR0     FART               FRTH
    dessert      D263      TSRT    DASAD               DSRT
    desert       D263      TSRT    DASAD               DSRT
    Byrne        B650       BRN     BYRN               BYRN
    Boern        B650       BRN     BARN                BRN
    Smith        S530       SM0     SNAT               SMTH
    Smyth        S530       SM0     SNYT              SMYTH
    Catherine    C365      K0RN  CATARAN              CTHRN
    Kathryn      K365      K0RN   CATRYN             KTHRYN
    

    You can see that phonetic encoding is doing a pretty good job making comparable the words. You could see different cases and prefer one or other depending on your case.

    Now I will just take the above and try to find the closest match using levenshtein_distance, but I could you any other too.

    """Select the closer by algorithm
    for instance levenshtein_distance"""
    report2 = pd.DataFrame([dataList]).T
    report2.columns = ['word']
    
    report.set_index('word',inplace=True)
    report2 = report.copy()
    for sounds_encoding in sounds_encoding_methods:
        report2[sounds_encoding.__name__] = np.nan
        matched_words = []
        for word in dataList:
            closest_list = []
            for word_2 in dataList:
                if word != word_2:
                    closest = {}
                    closest['word'] =  word_2
                    closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],
                                         report.loc[word_2,sounds_encoding.__name__])
                    closest_list.append(closest)
    
            report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
                sort_values(by = 'similarity').head(1)['word'].values[0]
    
    print(report2)
                 soundex  metaphone     nysiis match_rating_codex
    word                                                         
    two              too        too        too                too
    too              two         to         to                 to
    to               two        too        too                too
    fourth         forth      forth      forth              forth
    forth         fourth     fourth     fourth             fourth
    dessert       desert     desert     desert             desert
    desert       dessert    dessert    dessert            dessert
    Byrne          Boern      Boern      Boern              Boern
    Boern          Byrne      Byrne      Byrne              Byrne
    Smith          Smyth      Smyth      Smyth              Smyth
    Smyth          Smith      Smith      Smith              Smith
    Catherine    Kathryn    Kathryn    Kathryn            Kathryn
    Kathryn    Catherine  Catherine  Catherine          Catherine
    

    As from above you could clearly see that combinations between phonetic encoding and string comparison algorithm can be very straight forward.