I am trying to extract data from within paranthesis using a spacy matcher.
Say the text is: ' I am on StackOverflow(for x years) and I ask (technical) questions here about Natural Language Processing (NLP) (information retrieval)'
The desired output of the matcher is : (for x years), (technical) (NLP) (information retrieval)
Below is the code which I tried to work with
nlp = spacy.load("en_core_web_sm")
text = 'I am on StackOverflow(for x years) and I ask (technical) questions here about Natural Language Processing (NLP) (information retrieval)'
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": '(', }, {"TEXT": {"REGEX": r".*?"}}, {"ORTH": ')'}]
matcher.add('paranthesis_data', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
print(nlp.vocab.strings[match_id], doc[start:end])
The output I am getting is as below:
but I would like the output like : data (for x years) data (technical) data (NLP) data (information retrieval)
I know I could use regex, but that's not an option in my project. if I use 'OP' it is returning very long string matching, something like: (for x years) and I ask (technical)
Any help is very much appreciated and I would be very thankful.
In Matcher patterns, REGEX
matches a single token, not the text of the whole Doc. It isn't doing what you want.
I think you can get what you want with a pattern like this:
pattern = [{"TEXT": '(', }, {"TEXT": {"NOT_IN": [")"]}, "OP": "*"}, {"TEXT": ')'}]
A couple of other issues...
The string StackOverflow(for
(note lack of space) is probably going to be a single token. You'll need to adjust the tokenizer to deal with that if it's a common problem.
You seem to be using v2 style Matcher code. spaCy v3 has been out for a year, I would recommend upgrading if you're starting a new project.
Also see the rule-based matching docs.