I am importing an existing dataset in MySql into a Gremlin-Server and I have a specific pattern of relationships stored in the MySQL database which I am struggling to import.
An example of the pattern is as follows: we have Students
, Teachers
, and Reports
. For simplicity sake, the schema would look like this:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1 | Albert |
+------------+--------------+
+------------+--------------+
| teacher_id | teacher_name |
+------------+--------------+
| 1 | Bartolo |
+------------+--------------+
+------------+--------------+----------------+------------------+
| report_id | report_name | requester_type | requester_forid |
+------------+--------------+----------------+------------------+
| 1 | grade_lookup | student | 1 |
+------------+--------------+----------------+------------------+
| 2 | attendees | teacher | 1 |
+------------+--------------+----------------+------------------+
With full understanding that this is poor MySQL best-practice (part of the reason for us doing this migration), I have been performing my data import. Currently it runs the pattern of:
foreach($reports as $report) {
$gremlin->send("
g.V().addV('Report')
.property('report_id', '" . $report['report_id'] . "')
.property('fortype','" . $report['requester_fortype'] . "')
.property('forid', '" . $report['requester_forId'] . "')
");
}
Do the same for each of the Teachers and Student tables.
Run a MySQL join on the requester information to find the linked values, grouped by class, from the "Reports" table.
$query = "
SELECT report_id, requester_type, requester_forid
FROM report
WHERE requester_type = 'student'
JOIN student ON requester_forid = requester_type;
";
$rows = mysqli_query($query);
foreach($rows as $row){
# Upsert Gremlin Edge between the objects by coalescing the vertices
$gremlin->send("
g.V().has('Report', 'report_id', '". $row['report_id'] ."').as('fromRecord').
V().has('Student', student_id, ".$row['requester_forid'].").
coalesce(__.inE('report_for_student').where(outV().as('fromRecord')),
addE('report_for_student').from('fromRecord')
");
}
This pattern works well for a small amount of records, where I would be able to have a set of small, well defined interactions. However, since the vertices for the report record would already carry their link information, I am wondering if there is a way to simplify this process. My ideal workflow would be as follows:
_type
, add an outgoing edge to the vertex which matches the label of the "type" field with the "id" propertyMy attempt here is
g.V().has('requester_type').values('requester_type').store('rt')
.sideEffect(V().has('rt')).addE('is_fortype').from('rt')
but this adds edge from all 'requester_type' to all, and is not doing the filtering I would need it to do. Is there some better way for me to frame this specific problem? I can run many upsert-edge functions in a loop using my framework, or I can insert edges where properties match properties, using the properties of each parameter to inform the joins and their direction.
For loading data, I generally advise that you get your source data into forms that don't put you in situations where you have to try to do "joins" on vertices to build edges. The cost of iterating V()
to build these joins is expensive for anything but the smallest graphs unless there is an appropriate index in place. Of course, indexing fields just for this purpose seems "wrong" as it creates indices that aren't of substantive value outside of the load. On some graphs it may even come at a cost in terms of resources and load times.
Reducing your load to "load all the vertices" and "load all the edges" in your first approach seemed fine to me. I'm not sure why you need to use coalesce()
if this is a one time bulk load from scratch. If there are somehow duplicates in your data that you are trying to eliminate then I would do that in your query of the source data and avoid the chance of it at all. Getting rid of the existing edge check should improve the speed of the load.
You could probably make edge loading more generalized and easy if you added a globally unique identifier for your vertices rather than trying to use integers for edge loading. Then the logic for edge loading is even more direct and doesn't require your code to reason about vertex labels when loading edges. You could also then pre-sort the data and then build your Gremlin to batch-load edges around a single vertex which would help avoid lookups of the same vertex over and over again.
Finally, I assume the PHP driver your using allows for parameterization of your Gremlin scripts. I highly recommend you adjust your code to utilize that feature. It will drastically improve performance.