Search code examples
nlptokenizespacy

spacy how do I make a matcher which is noun-noun without white space within it?


I tried to make a matcher which could detect words like

'all-purpose'

I was trying to make a pattern like

pattern=[{'POS':'NOUN'}, {'ORTH':'-'},{'POS':'NOUN'}]

However, I realized that it only find the matches like

'all - purpose' with white space between tokens instead of 'all-purpose'.

How could I make a matcher like this? It has to be a generalized pattern like noun-noun instead of specific words like 'Barak Obama' as in the example in spacy documentation

Best,


Solution

  • What exactly are you trying to match? Using en_core_web_sm, "all-purpose" is three tokens and all has the ADV POS tag for me. So that might be the issue with your match pattern. If you just want hyphenated words this might be a better match:

    pattern = [{'IS_ALPHA': True}, {'ORTH':'-'}, {'IS_ALPHA': True}]
    

    More generally, you are correct that your pattern will only match three tokens, though that doesn't require white space - it depends on how the tokenizer works. For example, that's has no spaces but is two tokens.

    If you are finding hyphenated words that occur as one token and want to match them, you can use regular expressions in Matcher rules. Here's an example ofhow that would work from the docs:

    pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
    

    In your case it could just look like this:

    pattern = [{"TEXT": {"REGEX": "-"}}]