Search code examples
nlpspacydependency-parsing

In Spacy NLP, how extract the agent, action, and patient -- as well as cause/effect relations?


I would like to use Space to extract word relation information in the form of "agent, action, and patient." For example, "Autonomous cars shift insurance liability toward manufacturers" -> ("autonomous cars", "shift", "liability") or ("autonomous cars", "shift", "liability towards manufacturers"). In other words, "who did what to whom" and "what applied the action to something else." I don't know much about my input data, so I can't make many assumptions.

I also want to extract logical relationships. For example, "Whenever/if the sun is in the sky, the bird flies" or cause/effect cases like "Heat makes ice cream melt."

For dependencies, Space recommends iterating through sentences word by word and finding the root that way, but I'm not sure what clear pattern in traversal to use in order to get the information in a reliable way I can organize. My use case involves structuring these sentences into a form that I can use for queries and logical conclusions. This might be comparable to my own mini Prolog data store.

For cause/effect, I could hard-code some rules, but then I still need to find a way of reliably traversing the dependency tree and extracting information. (I will probably combine this with core resolution using neuralcoref and also word vectors and concept net to resolve ambiguities, but this is a little tangential.)

In short, the question is really about how to extract that information / how best to traverse.

On a tangential note, I am wondering if I really need a constituency tree as well for phrase-level parsing to achieve this. I think that Stanford provides that, but Spacy might not.


Solution

  • To the first part of your question, it's pretty easy to use token.dep_ to identify nsubj, ROOT, and dobj tags.

    doc = nlp("She eats carrots")
    
    for t in doc:
      if t.dep_ == "nsubj":
        print(f"The agent is {t.text}")
      elif t.dep_ == "dobj":
        print(f"The patient is {t.text}")
    

    In passive constructions, the patient's dep is nsubjpass, but there may or may not be an agent - that's the point of passive voice.

    To get the words at the same level of the dependency parse, token.lefts, token.children and token.rights are your friends. However, this won't catch things like "He is nuts!", since nuts isn't a direct object, but an attribute. If you also want to catch that, you'll want to look for attr tags.

    For the cause and effect stuff, before you decide on rules vs model, and what library... just gather some data. Get 500 sentences, and annotate them with the cause and effect. Then look at your data. See if you can pull it out with rules. There's a middle ground: you can identify candidate sentences with rules (high recall, low precision), then use a model to actually extract the relationships. But you can't do it from first principles. Doing data science requires being familiar with your data.