Search code examples
pythonlistnlptokenize

Search a python list for matches to a custom list of stem words of varying length


I'm trying to search word-tokenized abstracts for custom stem words using python. The following code is almost what I want. That is, do any of the values in stem_words appears once or more in word_tokenized_abstract?

if(any(word in stem_words for word in word_tokenized_abstract)):
    do stuff

where...

  • stem_words is a list of strings only
  • word_tokenized_abstract is a list of strings only

I based the above at one-liner to check if at least one item in list exists in another list?

My issue is that my stem_words are of different lengths. I've tried the following code (a modification of the above) which did not work for me. I've tried a few other modifications but they either don't work or cause a crash.

if(any(word in stem_words for word[0:len(word)] in word_tokenized_abstract)):
    do stuff

That is, do any of the values word_tokenized_abstract begin with any of the values in stem_words?

if it helps, my stem_words = ['pancrea', 'muscul', 'derma', 'ovar']

Thanks! I apologize if this question has been answered previously but I couldn't find it.


Solution

  • So you want to check if any string in a first list is contained in any of the strings of the second list.

    I'd try this:

    any(y.startswith(x) for y in word_tokenized_abstract for x in stem_words)
    

    Explanation: for each stem x in stem_words check if any string in word_tokenized_abstract starts with x.

    If you just want the stem to be a substring of the word then use:

    any(x in y for y in word_tokenized_abstract for x in stem_words)