Search code examples
pythoncypherlxmlhierarchical-dataagens-graph

Save DOM tree into a graph database: Connect related nodes


I'm inserting hierarchical data made of a DOM Tree into a graph database but, I'm not able to obtain the parent's ID which is needed to create a relationship between the child and its parent's id.

Below is the code that illustrates a traversing of DOM nodes, inserting the tags and obtaining the last inserted id. I need to insert and obtain both ids of the child and parent in order to create their relation.

from lxml import HTML
import age  # from AgensGraph
from age.gen.ageParser import *

GRAPH_NAME = "demo_graph"
DSN = "host=localhost port=5432 dbname=demodb user=userdemo 
password=demo234"

ag = age.connect(graph=GRAPH_NAME, dsn=DSN)
tree = html.parse("demo.html")
for element in tree.getiterator():
    if parent := element.getparent():        
        parent = None
        cursor = ag.execCypher("CREATE (t:node {name: %s} ) RETURN t", params=(element.tag))        
        b = [x[0].id for x in cursor]  # get last inserted ID 
        print(b[0])        
        ag.execCypher("MATCH (c:node), (p:node) WHERE c.id = %s AND p.id = %s CREATE (a)-[r:connects}]->(b)") # Match child node 'c', parent node: p and join  C Connects P (P is unknown)

Here is the demo file: demo.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8"/>
    <title>Document</title>
  </head>
  <body>
    <ul class="menu">
      <div class="itm">home</div>
      <div class="itm">About us</div>
      <div class="itm">Contact us</div>
    </ul>
    <div id="idone" class="classone">
      <li class="item1">First</li>
      <li class="item2">Second</li>
      <li class="item3">Third</li>
      <div id="innerone"><h1>This Title</h1></div>
      <div id="innertwo"><h2>Subheads</h2></div>      
    </div>
    <div id="second" class="below">
      <div class="inner">
        <h1>welcome</h1>
        <h1>another</h1>
        <h2>third</h2>
      </div>
    </div>
  </body>
</html>

Here is the extracted DOM Tree:

tag: head attrib: None parent: html
tag: meta attrib: ('charset', 'UTF-8') parent: head
tag: title attrib: None parent: head
tag: body attrib: None parent: html
tag: h1 attrib: None parent: div
tag: h1 attrib: None parent: div
tag: h2 attrib: None parent: div
/tmp/ipykernel_27254/2858024143.py:4: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
  if parent := element.getparent():

Solution

  • Executing CREATE statement takes effect after committing session. You should commit() after execCypher(...)

    cursor = ag.execCypher("CREATE (t:node {name: %s} ) RETURN t", params=(element.tag))        
    b = [x[0].id for x in cursor]
    ag.commit()
    

    Try following codes :

    ag = age.connect(graph=GRAPH_NAME, dsn=DSN)
    tree = html.parse("demo.html")
    for element in tree.getiterator():
        if parent := element.getparent():        
            parent = None
            cursor = ag.execCypher("CREATE (t:node {name: %s} ) RETURN t", params=(element.tag))        
            b = [x[0].id for x in cursor]  # get last inserted ID 
            ag.commit()
            print(b[0])        
            ag.execCypher("MATCH (c:node), (p:node) WHERE c.id = %s AND p.id = %s CREATE (a)-[r:connects}]->(b)") # Match child node 'c', parent node: p and join  C Connects P (P is unknown)