Search code examples
azuredatabricksazure-databricks

Databricks Cluster terminated. Reason: Cloud Provider Launch Failure


I'm using Azure Databricks with a custom configuration that uses vnet injection and I am unable to start a cluster in my workspace. The error message being given is not documented anywhere in microsoft or databricks documentation meaning I am unable to diagnose the reason why my cluster is not starting. I have reproduced the error-message below:

Instance ID: [redacted]

Azure error message: 
Instance bootstrap failed.
Failure message: Cloud Provider Failure. Azure VM Extension stuck on transitioning state. Please try again later.
VM extension code: ProvisioningState/transitioning
instanceId: InstanceId([redacted])
workerEnv: workerenv-6662162805421143
Additional details (may be truncated): Enable in progress

Although it says "Please try again later" I have been trying this all day and getting the same message, leading me to think this error message is not descriptive and there is something else really going on.

Does anyone have ideas on what the issue could be?


Solution

  • This seems to be an issue with connectivity from the databricks instance to the central databricks servers. Our vnet injection settings seem to have not been sufficient to route requests to the right place. Ultimately the problem was fixed by changing the databricks instance to use vnet peering (with its own custom vnet) instead of vnet injection. This way the databricks instance was able to communicate with our resource in another vnet while still being able to start the cluster.

    This fulfilled our project requirements, but there may be cases where its not sufficient for what a project requires. Hopefully the Azure Databricks team at least documents this issue to create less confusion in the future.

    I also tried creating custom user defined routes for databricks but that did not fix the issue.