Search code examples
pythonpython-3.xdifflib

Extract words in a paragraph that are similar to words in list


I have the following string:

"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

List of words to be extracted:

["town","teddy","chicken","boy went"]

NB: town and teddy are wrongly spelt in the given sentence.

I have tried the following but I get other words that are not part of the answer:

import difflib

sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

list1 = ["town","teddy","chicken","boy went"]

[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]

I am getting the following result:

[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]

instead of:

'twn', 'tddy', 'chicken','boy went'

Solution

  • Notice in the documentation for difflib.get_closest_matches():

    difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)

    Return a list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

    Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

    Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.


    At the moment, you are using the default n and cutoff arguments.

    You can specify either (or both), to narrow down the returned matches.

    For example, you could use a cutoff score of 0.75:

    result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
    

    Or, you could specify that only at most 1 match should be returned:

    result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
    

    In either case, you could use a list comprehension to flatten the lists of lists (since difflib.get_close_matches() always returns a list):

    matches = [r[0] for r in result]
    

    Since you also want to check for close matches of bigrams, you can do so by extracting pairings of adjacent "words", and pass them to difflib.get_close_matches() as part of the possibilities argument.

    Here is a full working example of this in action:

    import difflib
    import re
    
    sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
    
    list1 = ["town", "teddy", "chicken", "boy went"]
    
    # this extracts overlapping pairings of "words"
    # i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
    pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
    
    # we pass the sent.split() list as before
    # and concatenate the new pairs list to the end of it also
    result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
    
    matches = [r[0] for r in result]
    
    print(matches)
    # ['twn', 'tddy', 'chicken.', 'boy went']