Search code examples
pythonregexnlpnltkspacy

spacy matcher pattern IN + REGEX Tag


My goal is to match with spacy the sentences that contain one of the following words: ['studium','abschluss','ausbildung']

I can solve the problem with this line:

pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]

My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.

How can use the regex inside the IN sentence to match all words containing the word Studium?


Solution

  • You can use the REGEX operator:

    import re
    l = ['abschluss', 'ausbildung']
    pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]
    

    Note:

    • map(re.escape, l) - escapes the items in the l list
    • "|".join(...) - joins the words as alternatives (word1|word2|wordN)
    • ^(?:...|[^\W\d_]*studium)$ - a regex that matches
      • ^ - start of string (here, token)
      • (?:...|[^\W\d_]*studium) - a non-capturing group matching any of the l items or any zero or more letters ([^\W\d_]*) followed with studium
      • $ - end of string (token here).