Search code examples
pythonnlptokenizeparse-treebenepar

How can I prevent the benepar parser from splitting a specific substring when parsing a string?


I use the benepar parser to parse sentences into trees. How can I prevent the benepar parser from splitting a specific substring when parsing a string?

E.g., the token gonna is split by benepar into two tokens gon and na, which I don't want.


Code example, with pre-requisites:

pip install spacy benepar
python -m nltk.downloader punkt benepar_en3
python -m spacy download en_core_web_md

If I run:

import benepar, spacy
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md')
if spacy.__version__.startswith('2'):
    nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
else:
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
print(sent._.parse_string)

It'll output:

(S (NP (DT This)) (VP (VBZ is) (VP (TO gon) (VP (TO na) (VP (VB be) (NP (NN fun)))))) (. .))

The issue is that the token gonna is split into two tokens gon and na. How can I prevent that?


Solution

  • Use nlp.tokenizer.add_special_case:

    import benepar, spacy
    import nltk
    benepar.download('benepar_en3')
    nlp = spacy.load('en_core_web_md')
    from spacy.symbols import ORTH
    nlp.tokenizer.add_special_case(u'gonna', [{ORTH: u'gonna'}])
    if spacy.__version__.startswith('2'):
        nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
    else:
        nlp.add_pipe("benepar", config={"model": "benepar_en3"})
    doc = nlp("This is gonna be fun.")
    sent = list(doc.sents)[0]
    print(sent._.parse_string)
    

    This is the output for the above code:

    (S (NP (DT This)) (VP (VBZ is) (VP (TO gonna) (VP (VB be) (NP (NN fun))))) (. .))