Search code examples
graphimportetlorientdbnosql

Easiest way to import a simple csv file to a graph with OrientDB ETL


I would like to import a very simple directed graph file in csv to OrientDB. Concretely, the file is the roadNet-PA dataset from the SNAP collection https://snap.stanford.edu/data/roadNet-PA.html. The first lines of the file are as follows:

# Directed graph (each unordered pair of nodes is saved once)
# Pennsylvania road network
# Nodes: 1088092 Edges: 3083796
# FromNodeId    ToNodeId
0       1
0       6309
0       6353
1       0
6353    0
6353    6354

There is only one type of vertex (a road intersection) and edges have no information (I suppose OrientDB lightweight edges are the best option for this). Note also that vertices are spaced with tabs.

I've tried to create a simple etl to import the file with no success. Here is the etl:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "   ", "skipFrom": 1, "skipTo": 4 } },
    { "vertex": { "class": "Intersection" } },
    { "edge": { "class": "Road" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 

The etl works but it does not import the file as I expect. I suppose the problem is in the transformers. My idea is to read the csv line by line and create and edge connecting both vertices, but I'm not sure how to express this in an etl file. Any ideas?


Solution

  • Try this:

    {
      "config": {
        "log": "debug"
      },
      "source" : {
        "file": { "path": "/tmp/roadNet-PA.csv" }
      },
      "extractor": { "row": {} },
      "transformers": [
        { "csv": { "separator": "\t", "skipFrom": 1, "skipTo": 4,
                   "columnsOnFirstLine": false, 
                   "columns":["id", "to"] } },
        { "vertex": { "class": "Intersection" } },
        { "merge": { "joinFieldName":"id", "lookup":"Intersection.id" } },
        { "edge": {
           "class": "Road",
           "joinFieldName": "to",
           "lookup": "Intersection.id",
           "unresolvedLinkAction": "CREATE"
          }
        },
      ],
      "loader": {
        "orientdb": {
           "dbURL": "remote:localhost/roads",
           "dbType": "graph",
           "wal": false,
           "batchCommit": 1000,
           "tx": true,
           "txUseLog": false,
           "useLightweightEdges" : true,
           "classes": [
             {"name": "Intersection", "extends": "V"},
             {"name": "Road", "extends": "E"}
           ], "indexes": [
             {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
           ]
        }
      }
    } 
    

    To speedup loading I suggest you to shutdown the server, and import the ETL by using "plocal:" instead of "remote:". Example replacing the existent with:

           "dbURL": "plocal:/orientdb/databases/roads",