google-cloud-platform google-cloud-dataflow

How to connect a Dataflow job through a firewall? How to have predictable IP address to connect to Kafka or an RDBMS?

I'm trying to run a Dataflow job that needs to connect to a Kafka cluster. The Kafka cluster requires a whitelist of IP addresses for security reasons, so I need to ensure that the Dataflow job always connects from the same IP address.

How can I achieve this? Is it possible to specify a single predictable IP address for a Dataflow job? If not, what are some alternative solutions to ensure that the job always connects from an IP address known to a firewall / 'whitelisted'?

Solution

To 'force' Dataflow workers to use predictable IP addresses you need to set up several Google Cloud resources:

A VPC network
A subnetwork in a particular zone
Optional: Private Google Access from the subnetwork - this allows VMs and workers in the network to access Google Cloud services.
A Cloud Router to manage data routing to the VPC network
A single external IP address that will be the gateway - if you want to have multiple gateway IP addresses, you can create multiple of these.
A NAT configuration to assign all subnets of the VPC network to use the above IP address(es) as Gateway

Finally, when you launch your Dataflow job, you need to launch it in the subnetwork that you picked, and disable use of external IP addresses.

In practice, this looks like this:

Setting up networking configuration

First create the network and a subnetwork in a particular region:

gcloud compute networks create $NETWORK_NAME \
    --subnet-mode custom

gcloud compute networks subnets create $SUBNETWORK_NAME \
   --network $NETWORK_NAME \
   --region $REGION \
   --range 192.168.1.0/28   # This range can be changed for a larger address space

# Also enable private Google Access:
#  https://cloud.google.com/vpc/docs/configure-private-google-access#gcloud
gcloud compute networks subnets update $SUBNETWORK_NAME \
    --region=$REGION \
    --enable-private-ip-google-access

Now create a router and a NAT rule for the router:

gcloud compute routers create $ROUTER_NAME \
    --network $NETWORK_NAME \
    --region $REGION

## Also create a static IP address for this network
gcloud compute addresses create $ADDRESS_NAME \
  --region=$REGION

gcloud compute routers nats create $ROUTER_CONFIG_NAME \
    --router-region $REGION \
    --router $ROUTER_NAME \
    --nat-all-subnet-ip-ranges \
    --nat-external-ip-pool=$ADDRESS_NAME

This is all the networking set up you need. Now you can launch your Dataflow job.

Launching the Dataflow job

When launching the Dataflow job, you need to pass the subnetwork parameter and disable public IP addresses.

In Python:

python sample_pipeline.py \
  --runner=DataflowRunner \
  --project=$PROJECT \
  --region=$REGION \
  --subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
  --no_use_public_ips

In Java:

mvn compile exec:java -Dexec.mainClass=my.main.pipeline.Klass \
    -Dexec.args="--runner=DataflowRunner \
                 --project=$PROJECT \
                 --region=$REGION \
                 --subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
                 --usePublicIps=false" \
    -Pdataflow-runner