Search code examples
pythonbeautifulsoupnetworkx

NetworkX inconsistent graph membership when using Beautiful Soup Tags as nodes


I'm trying to build a graph of some tags in an HTML document. I'm using NetworkX to construct the graph and since based on the docs, "nodes can be any hashable object" I thought that Beautiful Soup Tags which define the following __hash__ function would be a reasonable choice for nodes.

def __hash__(self):
    return str(self).__hash__()

Unfortunately, this has resulted in some unexpected, non-deterministic behavior that I suspect is the result of some Tags being parents of others. I've reduced the example to the following

from bs4 import BeautifulSoup
import networkx as nx
import re

html = """
<html>
    <body>
        <div>1</div>
        <div>2</div>
        <div>3</div>
    </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
number_tags = soup.find_all(lambda tag: re.search(r'\d+', tag.text))

graph = nx.Graph()

for tag in number_tags:
    # calling extract prior to adding the node seems to be required to reproduce
    # but I see the same behavior with these two lines instead
    # print(tag.extract()
    # graph.add_node(tag)

    graph.add_node(tag.extract())

for node in graph:
    print(node in graph)

There are 5 tags matching the soup.find_all call: <html>, <body>, and the three <div>. Since I'm looping through the nodes in the graph in the last two lines, I would expect it to always print Truex5, but I've seen the following three outputs. If I run the script three consecutive times, I've seen it produce a different output each run.

Case 1: False False True True True
Case 2: False True True True True
Case 3: True True True True True

I'd like to understand why this is happening and how to avoid this behavior. I've tried stepping through the NetworkX code, but under the hood it's just a call to __contains__ of the built-in dict so I suspect the issue may be on the Beautiful Soup side of things.

Environment:

  • python==3.10.10
  • networkx==3.1
  • beautifulsoup4==4.12.2

Solution

  • The issue is, how you correctly stated how the hash is computer (here is link to the official source code):

    ...
        def __hash__(self):
            return str(self).__hash__()
    ...
    

    When you run this code (I've added simple debug print):

    for tag in number_tags:
        # calling extract prior to adding the node seems to be required to reproduce
        # but I see the same behavior with these two lines instead
        # print(tag.extract()
        # graph.add_node(tag)
    
        tag = tag.extract()
        print(tag, hash(tag))
        graph.add_node(tag)
    

    You receive:

    <html>
    <body>
    <div>1</div>
    <div>2</div>
    <div>3</div>
    </body>
    </html> -1283923517493214265      # <--- 1. add
    <body>
    <div>1</div>
    <div>2</div>
    <div>3</div>
    </body> 4746304241092308513       # <--- 2. add
    <div>1</div> 1046019422797671110  # <--- 3. add
    <div>2</div> 6250487624992407587  # <--- 4. add
    <div>3</div> 2831357280475664491  # <--- 5. add
    

    What's added to the networkx graph is a Tag object. And, for example, 2. add (<body> tag) will modify (with the .extract() method) the string representation of the first add (<html> tag).

    So when the for-loop where you adding elements to graph ends you have different string representations of tag (and thus different hash id) and the result is not deterministic.

    So when you check for elements in the graph:

    for node in graph:
        print(node, hash(node))
        print(node in graph)
    

    it prints (note the first two tag string representations are different):

    <html>
    
    </html> -5397864756596955648
    False
    <body>
    
    
    
    </body> -8497310738004871960
    False
    <div>1</div> 1046019422797671110    # <-- this hash id is the same (as string repr)
    True
    <div>2</div> 6250487624992407587    # <-- this hash id is the same (as string repr)
    True
    <div>3</div> 2831357280475664491    # <-- this hash id is the same (as string repr)
    True
    

    Workaround

    The workaround is to make copy when you add to the graph:

    graph.add_node(BeautifulSoup(str(tag), 'html.parser'))