I want to match a string to certain keywords and get the percentage and the substring that was matched to my keyword. E.g. I have a list of keywords
keywords = ['Projekt-Nr.:', 'Projektbezeichnung:', 'Anlagenklassifizierung:', 'Arbeiten / Gewerk:']
and some unknown text e.g.
s = "Projekthezeichnung: —_[H- Kloster Eig i Krankenhaus"
I want my keywords to be searched in this string so that it returns me the partially matched string.
'Projektbezeichnung:' should match 'Projekthezeichnung:' with over 95% accuracy (I am already using cdifflib for that) but cdifflib doesn't return the substring my keyword was matched with.
How can I get the unknown substring that my keyword was partially matched with?
Any help would be quite useful, thanks!
difflib
's get_close_matches
seems suitable:
from difflib import get_close_matches as gcm
keywords = ['Projekt-Nr.:', 'Projektbezeichnung:', 'Anlagenklassifizierung:', 'Arbeiten / Gewerk:']
unk_text = "Projekthezeichnung: —_[H- Kloster Eig i Krankenhaus"
words = unk_text.split()
result = [gcm(kw, words, n=len(words), cutoff=0.8) for kw in keywords]
# [[], ['Projekthezeichnung:'], [], []]
Each sublist of the result
list contains "close" matches to the corresponding keyword.