Search code examples
google-cloud-platformgoogle-cloud-dataflow

Google Dataflow "Workflow failed" with no reason


I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations. The logs I get are the following:

 2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
 2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
 2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
 2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
 2017-08-25 (00:06:48) Workflow failed.
 2017-08-25 (00:06:48) Stopping worker pool...
 2017-08-25 (00:06:58) Worker pool stopped.

How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully. When I try to rerun the template from Google Cloud Console, I get the message:

No metadata file found for this template

But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.

Any Input is highly appreciated. Thanks

EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.


Solution

  • I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.

    It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.