java google-cloud-dataflow apache-beam tinkerpop3 janusgraph

How to check why job gets killed on Google Dataflow ( possible OOM )

I've got the simple task. I've got a bunch of files ( ~100GB in total ), each line represents one entity. I have to send this entity to JanusGraph server.

2018-07-07_05_10_46-8497016571919684639 <- job id

After a while, I am getting OOM, logs say that Java gets killed.

From dataflow view, i can see the following logs:

Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:

From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0

Logs are saying: E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB More here: https://pastebin.com/raw/MftBwUxs

How can I debug what's going on?

Solution

There is too few information to debug the issue right now, so I am providing general information about Dataflow.

The most intuitive way for me to find the logs is going to Google Cloud Console -> Dataflow -> Select name of interest -> upper right corner (errors + logs).
More detailed information about monitoring is described here (in beta phase).
Some basic clues to troubleshoot the pipeline, as well as the most common error messages, are described here.

If you are not able to fix the issue, update the post with the error information please.

UPDATE

Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:

Consider one of, or a combination of, the following courses of action:

Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.

Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.

Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd" when you run your pipeline.

UPDATE 2

For specific OOM errors you can use:

--dumpHeapOnOOM will cause a heap dump to be saved locally when the JVM crashes due to OOM.
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.

Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.

Find other references on DataflowPipelineDebugOptions methods.

UPDATE 3

I did not find public documentation about this but I tested that Dataflow scales the heap JVM size with the machine type (workerMachineType), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.

On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.