Search code examples
pythonstringsubstringmatchingmetrics

Python - Get matched string percentage along with the string


I want to match a string to certain keywords and get the percentage and the substring that was matched to my keyword. E.g. I have a list of keywords

keywords = ['Projekt-Nr.:', 'Projektbezeichnung:', 'Anlagenklassifizierung:', 'Arbeiten / Gewerk:']

and some unknown text e.g.

s = "Projekthezeichnung: —_[H- Kloster Eig i Krankenhaus"

I want my keywords to be searched in this string so that it returns me the partially matched string.

'Projektbezeichnung:' should match 'Projekthezeichnung:' with over 95% accuracy (I am already using cdifflib for that) but cdifflib doesn't return the substring my keyword was matched with.

How can I get the unknown substring that my keyword was partially matched with?

Any help would be quite useful, thanks!


Solution

  • difflib's get_close_matches seems suitable:

    from difflib import get_close_matches as gcm
    
    keywords = ['Projekt-Nr.:', 'Projektbezeichnung:', 'Anlagenklassifizierung:', 'Arbeiten / Gewerk:']
    unk_text = "Projekthezeichnung: —_[H- Kloster Eig i Krankenhaus"
    words = unk_text.split()
    
    result = [gcm(kw, words, n=len(words), cutoff=0.8) for kw in keywords]
    # [[], ['Projekthezeichnung:'], [], []]
    

    Each sublist of the result list contains "close" matches to the corresponding keyword.