text nlp stanford-nlp opennlp information-extraction

NLP to find relationship between entities

My current understanding is that it's possible to extract entities from a text document using toolkits such as OpenNLP, Stanford NLP.

However, is there a way to find relationships between these entities?

For example consider the following text :

"As some of you may know, I spent last week at CERN, the European high-energy physics laboratory where the famous Higgs boson was discovered last July. Every time I go to CERN I feel a deep sense of reverence. Apart from quick visits over the years, I was there for three months in the late 1990s as a visiting scientist, doing work on early Universe physics, trying to figure out how to connect the Universe we see today with what may have happened in its infancy."

Entities: I (author), CERN, Higgs boson

Relationships : - I "visited" CERN - CERN "discovered" Higgs boson

Thanks.

Solution

You can extract verbs with their dependants using Stanford Parser, for example. E.g., you might get "dependency chains" like

"I :: spent :: at :: CERN".

It is a much tougher task to recognise that "I spent at CERN" and "I visited CERN" and "CERN hosted my visit" (etc) denote the same kind of event. Going into how this can be done is beyond the scope of an SO question, but you can read up literature of paraphrases recognition (here is one overview paper). There is also a related question on SO.

Once you can cluster similar chains, you'd need to find a way to label them. You could simply choose the verb of the most common chain in a cluster.

If, however, you have a pre-defined set of relation types you want to extract and lots of texts manually annotated for these relations, then the approach could be very different, e.g., using machine learning to learn how to recognize a relation type based on annotated data.