Search code examples
python-3.xnlpspacytext-extraction

Extract all the data within parenthesis using spacy matcher


I am trying to extract data from within paranthesis using a spacy matcher.

Say the text is: ' I am on StackOverflow(for x years) and I ask (technical) questions here about Natural Language Processing (NLP) (information retrieval)'

The desired output of the matcher is : (for x years), (technical) (NLP) (information retrieval)

Below is the code which I tried to work with

nlp = spacy.load("en_core_web_sm")
text = 'I am on StackOverflow(for x years) and I ask (technical) questions here about Natural Language Processing (NLP) (information retrieval)'
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": '(', }, {"TEXT": {"REGEX": r".*?"}}, {"ORTH": ')'}]
matcher.add('paranthesis_data', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
   print(nlp.vocab.strings[match_id], doc[start:end])

The output I am getting is as below:

enter image description here

but I would like the output like : data (for x years) data (technical) data (NLP) data (information retrieval)

I know I could use regex, but that's not an option in my project. if I use 'OP' it is returning very long string matching, something like: (for x years) and I ask (technical)

Any help is very much appreciated and I would be very thankful.


Solution

  • In Matcher patterns, REGEX matches a single token, not the text of the whole Doc. It isn't doing what you want.

    I think you can get what you want with a pattern like this:

    pattern = [{"TEXT": '(', }, {"TEXT": {"NOT_IN": [")"]}, "OP": "*"}, {"TEXT": ')'}]
    

    A couple of other issues...

    The string StackOverflow(for (note lack of space) is probably going to be a single token. You'll need to adjust the tokenizer to deal with that if it's a common problem.

    You seem to be using v2 style Matcher code. spaCy v3 has been out for a year, I would recommend upgrading if you're starting a new project.

    Also see the rule-based matching docs.