google-cloud-dataflow google-cloud-bigtable

Google Bigtable export hangs, is stuck, then fails in Dataflow. Workers never allocated

I'm trying to use this process:

https://cloud.google.com/bigtable/docs/exporting-sequence-files

to export my bigtable for backup. I've tried bigtable-beam-import versions 1.1.2 and 1.3.0 with no success. The program seems to kick off a Dataflow properly, but no matter what settings I use, workers never seem to get allocated to the job. The logs always say:

Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).

Then it hangs and workers never get allocated. If I let it run, the logs say:

2018-03-26 (18:15:03) Workflow failed. Causes: The Dataflow appears to be stuck. Workflow failed. Causes: The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.

then it gets cancelled:

Cancel request is committed for workflow job...

I think I've tried changing all the possible Pipeline options desrcrbed here:

https://cloud.google.com/dataflow/pipelines/specifying-exec-params

I've tried turning Autoscaling off and specifying the number of workers like this:

java -jar bigtable-beam-import-1.3.0-shaded.jar export \
    --runner=DataflowRunner \
    --project=mshn-preprod \
    --bigtableInstanceId=[something]\
    --bigtableTableId=[something] \
    --destinationPath=gs://[something] \
    --tempLocation=gs://[something] \
    --maxNumWorkers=10 \
    --zone=us-central1-c \
    --bigtableMaxVersions=1 \
    --numWorkers=10 \
    --autoscalingAlgorithm=NONE \
    --stagingLocation=gs:[something] \
    --workerMachineType=n1-standard-4

I also tried specifying the worker machine type. Nothing changes. Always autoscaling to 0 and fail. If there are people from the Dataflow team on, you can check out failed job ID: exportjob-danleng-0327001448-2d391b80.

Anyone else experience this?

Solution

After testing lots of changes to my GCloud project permissions, checking my quotas, etc. it turned out that my issue was with networking. This Stack Overflow question/answer was really helpful:

Dataflow appears to be stuck 3

It turns out that our team had created some networks/subnets in the gcloud project and removed the default network. When dataflow was trying to create VMs for the workers to run, it failed because it was unable to do so in the "default" network.

There was no error in the dataflow logs, just the one above about "dataflow being stuck." We ended up finding a helpful error message in the "Activity" stream on the gloud home page. We then solved the problem by creating a VPC literally called "default", with subnets called "default" in all the regions. Dataflow was then able to allocate VMs properly.

You should be able to pass network and subnet as pipeline parameters, but that didn't work for us using the BigTable export script provided (link in the question), but if you're writing Java code directly against the Dataflow API, you can probably fix the issue I had by setting the right network and subnet from your code.

Hope this helps anyone who is dealing with the symptoms we saw.