I have a dataset built by extracting data from external sources. The output is like this
Node Target
Jennifer Maria
Luke Mark
Johnny nan
Ludo Martin
Maria nan
Mark Luke
Mark Christopher
and so on
When I built a network using networkx, since the target field for some of my nodes is null, I have isolated nodes, while there should be linked to a source node (e.g., Maria
to Jennifer
).
I am considering directed network, but even if it was undirected, the problem would still persist since, when I load as nodes list the Nodes
column, I get nodes with nan
value in the Target
linked to a node called nan.
My question is: is there a way to check if the nodes within the Node column have a link (at least), looking at the Target column?
Happy to provide more information.
My expected output would be
Node Target
Jennifer Maria
Luke Mark
Johnny nan
Ludo Martin
Maria Jennifer
Mark Luke
Mark Christopher
In order to correctly create the network.
(a) find Target
where NaN
values,
(b) find Node
from a
in Target
.
(c) replace NaN by Node
from b
and update your original dataframe.
a = df.loc[df['Target'].isnull()]
b = df[df['Target'].isin(a['Node'])]
b = b.rename(columns={'Node': 'Target', 'Target': 'Node'})
c = pd.merge(a['Node'], b, how='left', on='Node').set_index(a.index)
df.update(c)
>>> a
Node Target
2 Johnny NaN
4 Maria NaN
>>> b
Target Node
0 Jennifer Maria
>>> c
Node Target
2 Johnny NaN
4 Maria Jennifer
>>> df
Node Target
0 Jennifer Maria
1 Luke Mark
2 Johnny NaN # <- NaN
3 Ludo Martin
4 Maria Jennifer # <- Jennifer
5 Mark Luke
6 Mark Christopher
Old Answer
As suggested by @AKX, remove rows with NaN
before build the graph:
import networkx as nx
edges = df[df.notna().all(1)]
G = nx.from_pandas_edgelist(edges, source='Node', target='Target')
>>> G.edges
EdgeView([('Jennifer', 'Maria'), ('Luke', 'Mark'),
('Mark', 'Christopher'), ('Ludo', 'Martin')])