Search code examples
pythonnetworkx

networkx dataset of multiple graphs in json format for GraphSage algorithm


I am trying to create dataset of multiple networkx graphs in json format. My goal is use to this dataset with labels for supervised training of my network. Networkx can dump json format for single graph. However, I am not sure how to handle multiple graphs in the same json file.

The GraphSage documentation claims that the example at https://github.com/williamleif/GraphSAGE/tree/master/example_data has multiple graphs.

The example_data subdirectory contains a small example of the protein-protein interaction data, which includes 3 training graphs + one validation graph and one test graph.

However, when I import the example-data toy-ppi-G.json into python I am not able to differentiate between different graphs or there is just a single graph. The data from json has following keys:

import json

with open('toy-ppi-G.json') as f:
    data = json.load(f)

data.keys()

# result:
dict_keys(['directed', 'graph', 'nodes', 'links', 'multigraph'])

My goal is to understand json format format for multiple graphs so that I can create my own datasets and use it for training purpose.


Solution

  • The overall idea is that you can represent multiple graphs as disjoint components of one big graph. That's what they do in the GraphSAGE repository and you can do it as well.

    You can store your multiple graphs in the same json file: as long as there are no edges between two distinct graphs, the GNN will "see" them as separate as well.

    I explain this in the documentation of my library for GNNs, here's a representation of the adjacency and node features matrices, where colors represent distinct graphs:

    enter image description here

    So to answer your question, simply enumerate your nodes sequentially across graphs and add them to the json file. Keep the same indices to add edges.