Search code examples
nlpspacyspacy-3dependency-parsing

Run dependency parser on pre-initialized doc object of spacy


I am trying to incorporate spacy's dependency parser into a legacy code in java through web API.

All other components tokenizer, tagger, merged_words, NER are done from the legacy NLP code. I am only interested to apply the dependency parser along with the dependency rule matcher of spacy 3.

I have tried the following approach

  1. creating a new doc object using https://spacy.io/api/doc#init.
from spacy.tokens import Doc
sent=["The heating_temperature was found to be 500 C"]
words=["The","heating_temperature", "was", "found", "to", "be", "500", "C"]
spaces=[True,True,True,True,True,True,True,False]
tags=["DT","NN","VBD","VBN","TO","VB","CD","NN"]
ents=["O","I-PARAMETER","O","O","O","O","I-VALUE","O"]
doc = Doc(nlp.vocab, words=words,spaces=spaces, tags=tags, ents=ents)
  1. Create an NLP pipeline with only parser
#can use nlp.blank too
nlp2 = spacy.load("en_core_web_sm", exclude=['attribute_ruler', 'lemmatizer', 'ner', "parser","tagger"])
pipeWithParser = nlp2.add_pipe("parser", source=spacy.load("en_core_web_sm"))
processed_dep = pipeWithParser(doc) #refer similar example in https://spacy.io/api/tagger#call

However, I am getting the following dependency tree

dependency tree

where every word is an nmod relation to the first word.

What am I missing? I could use the tagger of spacy too if req. I tried including tagger using above similar method but all tags were labeled 'NN'


Solution

  • The parser component in en_core_web_sm depends on the tok2vec component, so you need to run tok2vec on the doc before running parser for the parser to have the right input.

    doc = nlp2.get_pipe("tok2vec")(doc)
    doc = nlp2.get_pipe("parser")(doc)