Search code examples
pythonnlpspacymatcher

python spacy looking for two (or more) words in a window


I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other. For instance a concept would be any of the words forest, tree, nature in a distance less than 4 words from fire, burn, overheat

I am learning spacy and so far I can use the matcher like this:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])

That would match hello world and hello, world (or tree firing for the above mentioned example)

I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.

I had a look into: https://spacy.io/usage/rule-based-matching

and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.

Furthermore, I am not able to generalize that to more words as well.

Some ideas? Thanks


Solution

  • For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.

    Thus, you can write your matcher as

    import spacy
    from spacy.matcher import Matcher
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"},  {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
    

    which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for

    doc = nlp(u"Hello brave new world")
    for match_id, start, end in matcher(doc):
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id, string_id, start, end, span.text)
    

    it will print you

    15578876784678163569 HelloWorld 0 4 Hello brave new world
    

    And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.