I am reading an edge list from a file, and then from a different file I am reading an attribute called "date" which exists for some nodes. Note that some nodes in the edge list file do NOT have the "date" attribute.
However, when I check a few lines later whether the attribute exists for all the nodes which DO have the "date" attribute, my assertions fail for some nodes. Why is this happening?
The check for whether the nodeID starts with "11" is specific to the dataset, and is irrelevant to the issue I am facing (or at least I think it is irrelevant?).
My code:
import networkx as nx
# Load edges
G = nx.read_edgelist("edges.txt", nodetype=int, create_using=nx.DiGraph)
# Verify load was right
print("Number of nodes:", G.order())
print("Number of edges:", G.size())
print()
num_dates = 0
num_useless = 0
with open("node-dates.txt") as f:
for line in f:
if line[0] == "#":
continue
nodeID = line.split()[0]
if nodeID[:2]=="11":
nodeID = int(nodeID[2:])
if int(nodeID) in G:
num_dates += 1
else:
num_useless += 1
G.add_node(int(nodeID))
G.nodes[int(nodeID)]["date"] = line.split()[1]
# This below assert passes
assert(G.nodes[int(nodeID)]["date"] == line.split()[1])
print("Number of nodes:", G.order())
print("Number of nodes with dates:", num_dates+num_useless)
print("Number of nodes with dates and edges:", num_dates)
print("Number of nodes with dates but no edges:", num_useless)
print("Number of nodes with edges but no dates:", G.order()-num_dates-num_useless)
count_failed_assert = 0
with open("node-dates.txt") as f:
for line in f:
if line[0] == "#":
continue
nodeID = line.split()[0]
if nodeID[:2]=="11":
nodeID = int(nodeID[2:])
try:
# But this assert fails?
assert(G.nodes[int(nodeID)]["date"] == line.split()[1])
except:
count_failed_assert += 1
print(count_failed_assert)
The output:
Number of nodes: 34546
Number of edges: 421578
Number of nodes: 39951
Number of nodes with dates: 38557
Number of nodes with dates and edges: 33152
Number of nodes with dates but no edges: 5405
Number of nodes with edges but no dates: 1394
2446
The issue was cause by the dataset having multiple dates for a single node. I wasn't aware of this which is why I was facing issues.