Search code examples
pythongremlintinkerpoptinkerpop3janusgraph

Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model


Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.

  • I have three datasets, each containing about 20 milions rows (csv files)
  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.

Possibly there exist scripts for this. But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...? Or am I completely misinterpreting Janusgraph/Tinkerpop?

Thanks for any help in advance.

EDIT:

Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:

             metric_1    metric_2    metric_3    ..

person_1        a           e           i
person_2        b           f           j
person_3        c           g           k
person_4        d           h           l
..        

Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l]. (and later perhaps more elaborate sets of properties)

And are [a,..., l] then indexed?

The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?

Apologies for these probably straightforward questions, but I'm fairly new to this.


Solution

  • JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)

    cd /path/to/janus
    bin/janusgraph.sh start
    

    Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console

    bin/gremlin.sh -e scripts/load_data.script 
    

    An efficient way to load the data is to split it into two files:

    • nodes.csv: one line per node with all attributes
    • links.csv: one line per link with source_id and target_id and all the links attributes

    This might require some data preparation steps.

    Here is an example script

    The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.

    Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script