I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
id
, label
, from
and to
.id
and label
columns.from
column must equate to "the vertex ID of the from vertex."to
column must equate to "the vertex ID of the to vertex."My questions:
id
, label
, from
and to
? Or, should I add new columns?You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no
is guaranteed to be unique, rather than store it as a property called customer_no
you could instead make it the ~id
. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id
and ~label
. They are accessed differently using Gremlin steps such as hasLabel
and hasId
once the data is loaded. Columns with names from your domain like order_no
will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')