Search code examples
google-cloud-platformgoogle-cloud-dataflow

How to connect a Dataflow job through a firewall? How to have predictable IP address to connect to Kafka or an RDBMS?


I'm trying to run a Dataflow job that needs to connect to a Kafka cluster. The Kafka cluster requires a whitelist of IP addresses for security reasons, so I need to ensure that the Dataflow job always connects from the same IP address.

How can I achieve this? Is it possible to specify a single predictable IP address for a Dataflow job? If not, what are some alternative solutions to ensure that the job always connects from an IP address known to a firewall / 'whitelisted'?


Solution

  • To 'force' Dataflow workers to use predictable IP addresses you need to set up several Google Cloud resources:

    • A VPC network
    • A subnetwork in a particular zone
    • Optional: Private Google Access from the subnetwork - this allows VMs and workers in the network to access Google Cloud services.
    • A Cloud Router to manage data routing to the VPC network
    • A single external IP address that will be the gateway - if you want to have multiple gateway IP addresses, you can create multiple of these.
    • A NAT configuration to assign all subnets of the VPC network to use the above IP address(es) as Gateway

    Finally, when you launch your Dataflow job, you need to launch it in the subnetwork that you picked, and disable use of external IP addresses.

    In practice, this looks like this:

    Setting up networking configuration

    First create the network and a subnetwork in a particular region:

    gcloud compute networks create $NETWORK_NAME \
        --subnet-mode custom
    
    gcloud compute networks subnets create $SUBNETWORK_NAME \
       --network $NETWORK_NAME \
       --region $REGION \
       --range 192.168.1.0/28   # This range can be changed for a larger address space
    
    # Also enable private Google Access:
    #  https://cloud.google.com/vpc/docs/configure-private-google-access#gcloud
    gcloud compute networks subnets update $SUBNETWORK_NAME \
        --region=$REGION \
        --enable-private-ip-google-access
    

    Now create a router and a NAT rule for the router:

    gcloud compute routers create $ROUTER_NAME \
        --network $NETWORK_NAME \
        --region $REGION
    
    ## Also create a static IP address for this network
    gcloud compute addresses create $ADDRESS_NAME \
      --region=$REGION
    
    gcloud compute routers nats create $ROUTER_CONFIG_NAME \
        --router-region $REGION \
        --router $ROUTER_NAME \
        --nat-all-subnet-ip-ranges \
        --nat-external-ip-pool=$ADDRESS_NAME
    

    This is all the networking set up you need. Now you can launch your Dataflow job.

    Launching the Dataflow job

    When launching the Dataflow job, you need to pass the subnetwork parameter and disable public IP addresses.

    In Python:

    python sample_pipeline.py \
      --runner=DataflowRunner \
      --project=$PROJECT \
      --region=$REGION \
      --subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
      --no_use_public_ips
    

    In Java:

    mvn compile exec:java -Dexec.mainClass=my.main.pipeline.Klass \
        -Dexec.args="--runner=DataflowRunner \
                     --project=$PROJECT \
                     --region=$REGION \
                     --subnetwork=regions/$REGION/subnetworks/$SUBNETWORK_NAME \
                     --usePublicIps=false" \
        -Pdataflow-runner