google-cloud-platform google-compute-engine apache-flink google-cloud-dataproc service-accounts

Flink job running on Dataproc not finding Google Application Default Credentials

According to most of the documentation (which is not much) when running an app on Google Compute Engine, the Google client libraries should automatically pick up the Application Default Credentials that were used to generate the VM's.

I am currently running a Flink cluster on Dataproc (managed Hadoop). Dataproc is running on the Google Compute Engine platform with VM's for the master and worker nodes. When I deploy the job using Yarn, the job fails as it cannot detect the Application Default Credentials.

Does anyone know if Flink is able to automatically pick up the Application Default Credentials on the VM's? Do I need to configure anything or is this feature just not supported and I need to specify the service account JSON in the code manually?

Edit:

Some more information.

The Flink job is a streaming job (never ending) that picks records up and inserts them in a Google BigQuery table and a Google bucket. For this i am using two client libraries as listed below:

<dependency>
   <groupId>com.google.cloud</groupId>
   <artifactId>google-cloud-bigquery</artifactId>
   <version>1.65.0</version>
</dependency>
<dependency>
   <groupId>com.google.cloud</groupId>
   <artifactId>google-cloud-storage</artifactId>
   <version>1.65.0</version>
</dependency>

I added in the Main run function a GoogleCredentials.getApplicationDefault() call to make sure that the credentials are being picked up but this throws the following error:

The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials

In addition in the logging there is a line Failed to detect whether we are running on Google Compute Engine. This leads me to believe that it can't detect its in the Compute Engine platform.

From some reading online they said that a metadata server was used to detect this. We are running in a VPC so I don't think it is able to make that connection. Is this indeed the case and if so is there another approach I can use?

Solution

So this might not be for everyone but the issue was in the setup.

I was using a Kubernetes pod to start a Yarn session and use that to submit the job to the Flink cluster. Something to keep in mind if running this approach is that it seems that the Topology is run on the task managers and the main function is being called on the machine starting the Yarn session. In my case this was the pod.

Mounting the service account credentials to the pod and specifying the GOOGLE_APPLICATION_CREDENTIALS to point to that directory fixed the issue.