Read Pajek .net files using Graph-tool

I have a Pajek network file (undirected network with weighted edges), for which an example is provided here:

*Vertices 5
1  apple
2  cat
3  tree
4  nature
5  fire
*Edges
1  3  14
2  4  1

Node labels are provived without quoting. Edges are specified as node1, node2, edge weight.

I would need to read this file in graph-tool as an undirected graph with node labels and the "weight" attribute for edges. The function should also preserve isolate nodes.

Is there an efficient way to do this in Python? So far I have been reading the .net file with Networkx and then using a conversion function like this. I am looking for a way to speed up the process.

Solution

It appears that each section (Vertices/Edges) of the Pajek file can be interpreted as a space-delimited CSV file, which means you can parse it with pandas.read_csv(). That function is faster than the line-by-line parsing you suggested in your pure-python answer.

Also, it's faster to initialize the edge list and property lists all at once (as numpy arrays) rather than setting each element individually in a python loop.

I think the following implementation ought to be somewhat close to optimal, but I haven't benchmarked it.

import re
from io import StringIO

import numpy as np
import pandas as pd

import graph_tool as gt

def pajek_to_gt(path, directed=False, remove_loops=False):
    """
    Load a Pajek .NET file[1] as a graph_tool.Graph.
    Supports files which specify their edges via node pairs.
    Does not support files which specify their edges via the
    'edgeslist' scheme (i.e. the neighbors-list scheme).

    Note:
        Vertices are renumbered to start with 0, per graph-tool
        conventions (not Pajek conventions, which start with 1).

    Author: Stuart Berg (github.com/stuarteberg)
    License: MIT

    [1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
    """
    # Load into RAM
    with open(path, 'r') as f:
        full_text = f.read()

    if '*edgeslist' in full_text:
        raise RuntimeError("Neighbor list format not supported.")

    # Erase comment lines
    full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)

    # Erase blank lines (including those created by erasing comments)
    full_text = re.sub(r'\n+', '\n', full_text)

    # Ensure delimiter is a single space
    full_text = re.sub(r'[ \t]+', ' ', full_text)

    num_vertices = int(StringIO(full_text).readline().split()[-1])

    # Split into vertex section and edges section
    # (Vertex section might be empty)
    vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]

    # Parse vertices (if present)
    v_df = None
    if vertex_text:
        v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
        assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
            "File does not list all vertices, or lists them out of order."

    # Parse edges
    e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
    if len(e_df.columns) == 2:
        e_df.columns = ['v1', 'v2']
    elif len(e_df.columns) == 3:
        e_df.columns = ['v1', 'v2', 'weight']
    else:
        raise RuntimeError("Can't understand edge list")

    e_df[['v1', 'v2']] -= 1

    # Free up some RAM
    del full_text, vertex_text, edges_text

    # Create graph
    g = gt.Graph(directed=directed)
    g.add_vertex(num_vertices)
    g.add_edge_list(e_df[['v1', 'v2']].values)

    # Add properties
    if 'weight' in e_df.columns:
        g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
    if v_df is not None:
        g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)

    if remove_loops:
      gt.stats.remove_self_loops(g)

    return g

Here's what it returns for your example file:

In [1]: from pajek_to_gt import pajek_to_gt

In [2]: g = pajek_to_gt('pajek-example.NET')

In [3]: g.get_vertices()
Out[3]: array([0, 1, 2, 3, 4])

In [4]: g.vertex_properties['label'].get_2d_array([0])
Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')

In [5]: g.get_edges()
Out[5]:
array([[0, 2],
       [1, 3]])

In [6]: g.edge_properties['weight'].get_array()
Out[6]: PropertyArray([14.,  1.])

Note: This function does some preprocessing to convert double-spaces into single-spaces, since your example above uses double-spaces between entries. Was that intentional? The Pajek file specification you linked to uses single-spaces.

Edit:

Upon re-reading the Pajek file spec you linked to, I notice that there are two possible formats for the edges section. The second format lists each node's neighbors, in a variable-length list:

*edgeslist
4941 386 395 451
1 3553 3586 3587 3637
2 3583
3 4930
4 88
5 13 120

Obviously, my implementation above is not compatible with that format. I've edited the function to raise an exception if that format is used in the file.