My team is looking at migrating Janusgraph
data between instances (we are using Janusgraph
on top of Google Cloud BigTable
), using 2 separate approaches:
graphml
file, and import it into the other instanceBigTable
table, and import it into the table underlying the other instanceHowever, for each of the approaches, we are facing issues:
java.io.IOException: Connection reset by peer
issue, even after setting the gremlin server timeout to beyond 20minsBigTable
table via Cloud Dataflow
in 3 separate formats (as advised here), all with a different issue faced:
Avro
format: after exporting the avro
files, when re-importing them to the new table, we face the following error: Error message from worker: java.io.IOException: At least 8 errors occurred writing to Bigtable. First 8 errors: Error mutating row ( ;�! with mutations [set cell ....] .... Caused by: java.lang.NullPointerException
- since Janusgraph
stores binary data into BigTable
, perhaps the dataflow job is unable to export avro
files properlySequenceFile
format: when reimporting these files, we face the following error: Error message from worker: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 310 actions: StatusRuntimeException: 310 times, servers with issues: batch-bigtable.googleapis.com
Parquet
format: this proves to be the most promising, and the import job mostly completed (except for an error seen during the downscaling of dataflow
workers Root cause: The worker lost contact with the service.
). When reimporting to the target table, data is generally intact. However, the indexes appear to be "cranky" after the import (e.g. when querying a particular node using a has()
filter on an indexed property, the query completes quickly, but does not return any results)Would appreciate any opinions/inputs on the above issues, thanks!
So the problem here appears to be that Dataflow
is failing mutation requests with more than 100k mutations per row (due to BigTable
's limitation). However, the newer version of ParquetToBigTable template provided by Google appears to have a new parameter called "splitLargeRows", which helps split up large rows so that the number of mutations stays <= 100k.