My goal is to match with spacy the sentences that contain one of the following words: ['studium','abschluss','ausbildung']
I can solve the problem with this line:
pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]
My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.
How can use the regex inside the IN sentence to match all words containing the word Studium?
You can use the REGEX
operator:
import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]
Note:
map(re.escape, l)
- escapes the items in the l
list"|".join(...)
- joins the words as alternatives (word1|word2|wordN
)^(?:...|[^\W\d_]*studium)$
- a regex that matches
^
- start of string (here, token)(?:...|[^\W\d_]*studium)
- a non-capturing group matching any of the l
items or any zero or more letters ([^\W\d_]*
) followed with studium
$
- end of string (token here).