Search code examples
python-3.xgraph-tool

Read Pajek .net files using Graph-tool


I have a Pajek network file (undirected network with weighted edges), for which an example is provided here:

*Vertices 5
1  apple
2  cat
3  tree
4  nature
5  fire
*Edges
1  3  14
2  4  1

Node labels are provived without quoting. Edges are specified as node1, node2, edge weight.

I would need to read this file in graph-tool as an undirected graph with node labels and the "weight" attribute for edges. The function should also preserve isolate nodes.

Is there an efficient way to do this in Python? So far I have been reading the .net file with Networkx and then using a conversion function like this. I am looking for a way to speed up the process.


Solution

  • It appears that each section (Vertices/Edges) of the Pajek file can be interpreted as a space-delimited CSV file, which means you can parse it with pandas.read_csv(). That function is faster than the line-by-line parsing you suggested in your pure-python answer.

    Also, it's faster to initialize the edge list and property lists all at once (as numpy arrays) rather than setting each element individually in a python loop.

    I think the following implementation ought to be somewhat close to optimal, but I haven't benchmarked it.

    import re
    from io import StringIO
    
    import numpy as np
    import pandas as pd
    
    import graph_tool as gt
    
    def pajek_to_gt(path, directed=False, remove_loops=False):
        """
        Load a Pajek .NET file[1] as a graph_tool.Graph.
        Supports files which specify their edges via node pairs.
        Does not support files which specify their edges via the
        'edgeslist' scheme (i.e. the neighbors-list scheme).
    
        Note:
            Vertices are renumbered to start with 0, per graph-tool
            conventions (not Pajek conventions, which start with 1).
    
        Author: Stuart Berg (github.com/stuarteberg)
        License: MIT
    
        [1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
        """
        # Load into RAM
        with open(path, 'r') as f:
            full_text = f.read()
    
        if '*edgeslist' in full_text:
            raise RuntimeError("Neighbor list format not supported.")
    
        # Erase comment lines
        full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)
    
        # Erase blank lines (including those created by erasing comments)
        full_text = re.sub(r'\n+', '\n', full_text)
    
        # Ensure delimiter is a single space
        full_text = re.sub(r'[ \t]+', ' ', full_text)
    
        num_vertices = int(StringIO(full_text).readline().split()[-1])
    
        # Split into vertex section and edges section
        # (Vertex section might be empty)
        vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]
    
        # Parse vertices (if present)
        v_df = None
        if vertex_text:
            v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
            assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
                "File does not list all vertices, or lists them out of order."
    
        # Parse edges
        e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
        if len(e_df.columns) == 2:
            e_df.columns = ['v1', 'v2']
        elif len(e_df.columns) == 3:
            e_df.columns = ['v1', 'v2', 'weight']
        else:
            raise RuntimeError("Can't understand edge list")
    
        e_df[['v1', 'v2']] -= 1
    
        # Free up some RAM
        del full_text, vertex_text, edges_text
    
        # Create graph
        g = gt.Graph(directed=directed)
        g.add_vertex(num_vertices)
        g.add_edge_list(e_df[['v1', 'v2']].values)
    
        # Add properties
        if 'weight' in e_df.columns:
            g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
        if v_df is not None:
            g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)
    
        if remove_loops:
          gt.stats.remove_self_loops(g)
    
        return g
    

    Here's what it returns for your example file:

    In [1]: from pajek_to_gt import pajek_to_gt
    
    In [2]: g = pajek_to_gt('pajek-example.NET')
    
    In [3]: g.get_vertices()
    Out[3]: array([0, 1, 2, 3, 4])
    
    In [4]: g.vertex_properties['label'].get_2d_array([0])
    Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')
    
    In [5]: g.get_edges()
    Out[5]:
    array([[0, 2],
           [1, 3]])
    
    In [6]: g.edge_properties['weight'].get_array()
    Out[6]: PropertyArray([14.,  1.])
    

    Note: This function does some preprocessing to convert double-spaces into single-spaces, since your example above uses double-spaces between entries. Was that intentional? The Pajek file specification you linked to uses single-spaces.


    Edit:

    Upon re-reading the Pajek file spec you linked to, I notice that there are two possible formats for the edges section. The second format lists each node's neighbors, in a variable-length list:

    *edgeslist
    4941 386 395 451
    1 3553 3586 3587 3637
    2 3583
    3 4930
    4 88
    5 13 120
    

    Obviously, my implementation above is not compatible with that format. I've edited the function to raise an exception if that format is used in the file.