Search code examples
pythonpandaslevenshtein-distanceedit-distance

How can I compare different rows of one column with Levenshtein distance metric in pandas?


I have a table like this:

id name
1 gfh
2 bob
3 boby
4 hgf

etc.

I am wondering how can I use Levenshtein metric to compare different rows of my 'name' column?

I already know that I can use this to compare columns:

L.distance('Hello, Word!', 'Hallo, World!')

But how about rows?


Solution

  • Here is a way to do it with pandas and numpy:

    from numpy import triu, ones
    t = """id name
    1 gfh
    2 bob
    3 boby
    4 hgf"""
    
    df = pd.read_csv(pd.core.common.StringIO(t), sep='\s{1,}').set_index('id')
    print df
    
            name
    id      
    1    gfh
    2    bob
    3   boby
    4    hgf
    

    Create dataframe with list of strings to mesure distance:

    dfs = pd.DataFrame([df.name.tolist()] * df.shape[0], index=df.index, columns=df.index)
    dfs = dfs.applymap(lambda x: list([x]))
    print dfs
    
        id      1      2       3      4
    id                             
    1   [gfh]  [bob]  [boby]  [hgf]
    2   [gfh]  [bob]  [boby]  [hgf]
    3   [gfh]  [bob]  [boby]  [hgf]
    4   [gfh]  [bob]  [boby]  [hgf]
    

    Mix lists to form a matrix with all variations and make upper right corner as NaNs:

    dfd = dfs + dfs.T
    dfd = dfd.mask(triu(ones(dfd.shape)).astype(bool))
    print dfd
    
    id            1            2            3    4
    id                                            
    1           NaN          NaN          NaN  NaN
    2    [gfh, bob]          NaN          NaN  NaN
    3   [gfh, boby]  [bob, boby]          NaN  NaN
    4    [gfh, hgf]   [bob, hgf]  [boby, hgf]  NaN
    

    Measure L.distance:

    dfd.applymap(lambda x: L.distance(x[0], x[1]))