Search code examples
pythonattributeerrorfuzzy-comparisonphonetics

How to calulate the normalized editex similarity between two strings from seperate columns


I am trying to calculate the normalized editex similarity between two strings using python. ASo far I have used this code to get the raw editex distance which has worked fine:

new_df["EdxScore"] = new_df.apply(lambda x: editex.(x[0],x[1]), axis=1)

I have read the documentation here: https://anhaidgroup.github.io/py_stringmatching/v0.3.x/Editex.html

However when I try:

new_df["EdxScore"] = new_df.apply(lambda x: textdistance.editex.get_sim_score(x[0],x[1]), axis=1)

I get the error:

AttributeError: ("'Editex' object has no attribute 'get_sim_score'", 'occurred at index 0')

I'm not entirely sure what's going wrong here so any help would be much appreciated!


Solution

  • Turns out I didn't read the documentatation properly and the arguments to use are defined.

    For clarity I have pasted the arguments below:

    All algorithms have 2 interfaces:

    Class with algorithm-specific params for customizing.
    Class instance with default params for quick and simple usage.
    

    All algorithms have some common methods:

    .distance(*sequences) – calculate distance between sequences.
    .similarity(*sequences) – calculate similarity for sequences.
    .maximum(*sequences) – maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
    .normalized_distance(*sequences) – normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
    .normalized_similarity(*sequences) – normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
    

    Most common init arguments:

    qval – q-value for split sequences into q-grams. Possible values:
        1 (default) – compare sequences by chars.
        2 or more – transform sequences to q-grams.
        None – split sequences by words.
    as_set – for token-based algorithms:
        True – t and ttt is equal.
        False (default) – t and ttt is different.
    

    https://pypi.org/project/textdistance/