I have the following string:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
List of words to be extracted:
["town","teddy","chicken","boy went"]
NB: town and teddy are wrongly spelt in the given sentence.
I have tried the following but I get other words that are not part of the answer:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
I am getting the following result:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
instead of:
'twn', 'tddy', 'chicken','boy went'
Notice in the documentation for difflib.get_closest_matches()
:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches.
word
is a sequence for which close matches are desired (typically a string), andpossibilities
is a list of sequences against which to matchword
(typically a list of strings).Optional argument
n
(default3
) is the maximum number of close matches to return;n
must be greater than0
.Optional argument
cutoff
(default0.6
) is a float in the range[0, 1]
. Possibilities that don’t score at least that similar to word are ignored.
At the moment, you are using the default n
and cutoff
arguments.
You can specify either (or both), to narrow down the returned matches.
For example, you could use a cutoff
score of 0.75:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
Or, you could specify that only at most 1 match should be returned:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
In either case, you could use a list comprehension to flatten the lists of lists (since difflib.get_close_matches()
always returns a list):
matches = [r[0] for r in result]
Since you also want to check for close matches of bigrams, you can do so by extracting pairings of adjacent "words", and pass them to difflib.get_close_matches()
as part of the possibilities
argument.
Here is a full working example of this in action:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']