Search code examples
pythonstringnlpfuzzy-comparisonstringdist

Accept "close matches" when using strings in a python functions?


I'm trying to use a shortest path function to find the distance between strings in a graph. The problem is that sometimes there are close matches that I want to count. For example, I would like "communication" to count as "communications" or "networking device" to count as "network device". Is there a way to do this in python? (e.g., extract the root of words, or compute a string distance, or perhaps a python library that already have word-form relationships like plural/gerund/misspelled/etc) My problem right now is that my process only works when there is an exact match for every item in my database, which is difficult to keep clean.

For example:

List_of_tags_in_graph = ['A', 'list', 'of', 'tags', 'in', 'graph']

given_tag = 'lists'

if min_fuzzy_string_distance_measure(given_tag, List_of_tags_in_graph) < threshold :
     index_of_min = index_of_min_fuzzy_match(given_tag, List_of_tags_in_graph)
     given_tag = List_of_tags_in_graph[index_of_min]

#... then use given_tag in the graph calculation because now I know it matches ...

Any thought on easy or quick way to do this? Or, perhaps a different way to think about accepting close-match strongs ... or perhaps just better error handling when strings don't match?


Solution

  • Try using nltk WorldNetLemmatizer, it is designed to extract root of words. https://www.nltk.org/_modules/nltk/stem/wordnet.html