I've got the simple task. I've got a bunch of files ( ~100GB in total )
, each line represents one entity. I have to send this entity to JanusGraph server.
2018-07-07_05_10_46-8497016571919684639 <- job id
After a while, I am getting OOM, logs say that Java gets killed.
From dataflow view, i can see the following logs:
Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0
Logs are saying:
E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child
E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB
More here: https://pastebin.com/raw/MftBwUxs
How can I debug what's going on?
There is too few information to debug the issue right now, so I am providing general information about Dataflow.
name
of interest -> upper right corner (errors + logs). If you are not able to fix the issue, update the post with the error information please.
UPDATE
Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:
Consider one of, or a combination of, the following courses of action:
- Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
- Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
- Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd" when you run your pipeline.
UPDATE 2
For specific OOM errors you can use:
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket>
will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.
Find other references on DataflowPipelineDebugOptions methods.
UPDATE 3
I did not find public documentation about this but I tested that Dataflow scales the heap JVM size
with the machine type (workerMachineType
), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.
On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.