Search code examples
pythonpandasgraphgraph-tool

Generating graph-tool graph from Pandas DataFrame or CSV


I've started using graph-tool, hoping it would be a python library that will allow me to analyze large graphs (~8M vertices, ~22M edges, in a Pandas DataFrame / CSV). 'source' and 'target' columns are user ids for a certain digital service.

I started out with a toy example, following the method in this post.

import pandas as pd

df = pd.DataFrame({'source':range(11,15), 'target':range(12,16)})

g = Graph(directed=True)

g.add_edge_list(df.values)

you can see in my dummy example, there are only 5 distinct vertices (11, 12, 13, 14, 15). However, when I generate the graph, 16 vertices are created, seemingly filling the gap between 0 and the max node value.

g.get_vertices()

returns:

    array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15], dtype=uint64)

I assume that graph-tool 'reads' the values of the df as indices, not as the actual vertices' names. This follows from the docs:

Each vertex in a graph has an unique index, which is always between :math:0 and :math:N-1, where :math:N is the number of vertices.

How do I create a graph without these redundant vertices (which, if I import my data, could be in the millions), and how can I get to work with my user ids not being regarded as indices? I've been rummaging through the available methods / documentation and couldn't figure it out, for the mass import from df case.

What else I tried:

df.to_csv('test.csv', index=False)#, header=False)    
g2 = graph_tool.load_graph_from_csv('test.csv', skip_first=True)

This does seem to create a graph with only 5 vertices, but 'loses' their names (user ids).

g2.get_vertices()

returns

array([0, 1, 2, 3, 4], dtype=uint64)

Instead of [11, 12, 13, 14, 15].

Appreciate your help! Thanks in advance.

I am using python 2.7 on Jupyter/Anaconda.


Solution

  • What you want is enabled by the hashed parameter of the add_edge_list() method:

    vmap = g.add_edge_list(df.values, hashed=True)
    

    where vmap is a property map with the vertex "names".

    From the docstring:

    Optionally, if hashed == True, the vertex values in the edge list are not assumed to correspond to vertex indices directly. In this case they will be mapped to vertex indices according to the order in which they are encountered, and a vertex property map with the vertex values is returned. If string_vals == True, the algorithm assumes that the vertex values are strings. Otherwise, they will be assumed to be numeric if edge_list is a :class:~numpy.ndarray, or arbitrary python objects if it is not.

    Note that to guarantee efficient data structures, in graph-tool vertices are always contiguous integers, so they will always be numbered from 0 to N-1. If you want to give them different "names", you have to use property maps, as described in the documentation.