I'm trying to run a Dataflow job that needs to connect to a Kafka cluster. The Kafka cluster requires a whitelist of IP addresses for security reasons, so I need to ensure that the Dataflow job always connects from the same IP address.
How can I achieve this? Is it possible to specify a single predictable IP address for a Dataflow job? If not, what are some alternative solutions to ensure that the job always connects from an IP address known to a firewall / 'whitelisted'?
To 'force' Dataflow workers to use predictable IP addresses you need to set up several Google Cloud resources:
Finally, when you launch your Dataflow job, you need to launch it in the subnetwork that you picked, and disable use of external IP addresses.
In practice, this looks like this:
First create the network and a subnetwork in a particular region:
gcloud compute networks create $NETWORK_NAME \
--subnet-mode custom
gcloud compute networks subnets create $SUBNETWORK_NAME \
--network $NETWORK_NAME \
--region $REGION \
--range 192.168.1.0/28 # This range can be changed for a larger address space
# Also enable private Google Access:
# https://cloud.google.com/vpc/docs/configure-private-google-access#gcloud
gcloud compute networks subnets update $SUBNETWORK_NAME \
--region=$REGION \
--enable-private-ip-google-access
Now create a router and a NAT rule for the router:
gcloud compute routers create $ROUTER_NAME \
--network $NETWORK_NAME \
--region $REGION
## Also create a static IP address for this network
gcloud compute addresses create $ADDRESS_NAME \
--region=$REGION
gcloud compute routers nats create $ROUTER_CONFIG_NAME \
--router-region $REGION \
--router $ROUTER_NAME \
--nat-all-subnet-ip-ranges \
--nat-external-ip-pool=$ADDRESS_NAME
This is all the networking set up you need. Now you can launch your Dataflow job.
When launching the Dataflow job, you need to pass the subnetwork
parameter and disable public IP addresses.
In Python:
python sample_pipeline.py \
--runner=DataflowRunner \
--project=$PROJECT \
--region=$REGION \
--subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
--no_use_public_ips
In Java:
mvn compile exec:java -Dexec.mainClass=my.main.pipeline.Klass \
-Dexec.args="--runner=DataflowRunner \
--project=$PROJECT \
--region=$REGION \
--subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
--usePublicIps=false" \
-Pdataflow-runner