I have a String and a list defined as below
my_string = 'she said he replied'
my_list = ['This is a cool sentence', 'This is another sentence','she said hello he replied goodbye', 'she replied', 'Some more sentences in here', 'et cetera et cetera...']
I am trying to check if at least 3 words in my_string
exists in any of the strings in my_list
. The approach i'm taking is to split my_string
, and use all
to do the matching. However, this only works if all the items in my_string
exist in a sentence from my_list
if all(word in item for item in my_list for word in my_string.split()):
print('we happy')
1- How can I make it so the condition is satisfied if at least 3 items of my_string
are present in the sentence list?
2- Is it possible to match only the first and last word in my_string
in the same order? i.e "she" and "replied" are present in 'she replied' at index 3 of my_list
, return True.
Regarding part 1, I think this should work, and I would recommend using a regex and not string.split for finding words.You could also use nltk.word_tokenize if your sentences have complex words and punctuation. They are both slower than string.split, but if you need them, they're useful.
Here's a couple decent posts highlighting the differences (wordpunct-tokenize is basically a word regex in disguise):
nltk wordpunct_tokenize vs word_tokenize
Python re.split() vs nltk word_tokenize and sent_tokenize
import re
num_matches = 3
def get_words(input):
return re.compile('\w+').findall(input)
my_string = 'she said he replied'
my_list = ['This is a cool sentence', 'This is another sentence','she said hello he replied goodbye', 'she replied', 'Some more sentences in here', 'et cetera et cetera...']
my_string_word_set = set(get_words(my_string))
my_list_words_set = [set(get_words(x)) for x in my_list]
result = [len(my_string_word_set.intersection(x)) >= num_matches for x in my_list_words_set]
print(result)
Results in
[False, False, True, False, False, False]
For part 2, something like this should work, though it's not a super clean solution. If you don't want them just in order, but next to each other, check that the indexes are 1 apart instead.
words = get_words(my_string)
first_and_last = [words[0], words[-1]]
my_list_dicts = []
for sentence in my_list:
word_dict = {}
sentence_words = get_words(sentence)
for i, word in enumerate(sentence_words):
word_dict[word] = i
my_list_dicts.append(word_dict)
result2 = []
for word_dict in my_list_dicts:
if all(k in word_dict for k in first_and_last) and word_dict[first_and_last[0]] < word_dict[first_and_last[1]]:
result2.append(True)
else:
result2.append(False)
print(result2)
Result:
[False, False, True, True, False, False]