I am trying to build a knowledge graph based on textual documents (unstructured data). Therefore my current approach is to extract triples from the data and send these over to a graph database, e.g. neo4j for further analyses. What I notice however is that in the construction of triples there are many, let's call them, 'conditional triples'. An example:
text = "Donald Trump was president-elect for the republican party since July 2016"
Provides the following 'interesting' triples:
(Donald Trump, was, president-elect)
(Donald Trump, was president-elect for, republican party)
(Donald Trump, was president-elect for republican party since, July 2016)
We thus need three 4 nodes:
1. Donald Trump
2. president-elect
2. republican party
2. July 2016
Those are the 4 nodes that might have interesting relations to other entities in the graph. However, my difficulty (or doubts), are with the relationships, these seem very specific and long.
I am not sure whether this actually is an issue, or whether it would be best practice to include such long relationships, such as was president-elect for republican party since
.
I have considered looking into creating traversals like:
(Donald Trump)-[was]->(president-elect)-[for]->(republican party)-[since]->(July 2016)
This provides more 'simple' relationships, however this is either a unique traversal such that other president-elects
are not related to this particular node, or if it is not a unique traversal, then other president-elects are related to this same node but then the for
and since
relationships can no longer be uniquely tracked to Donald Trump
.
As a result I am now inclined to apply the longer relationships. My question therefore is: Is that a best-practice approach, or am I missing alternative solutions?
Here is a possible data model:
(:Person {name:"Donald Trump"})-[:ACHIEVED {date:'2016-07-01'}]->(pos:Position)
(pos)-[:HAS_TITLE]->(:Title {name:"President Elect"})
(pos)-[:FOR_PARTY]->(:Party {name:"Republican"})
The Person
, Title
, and Party
nodes are unique.