security google-cloud-dataflow access-control

How to sandbox/limit access for a Google Cloud Dataflow pipeline running in the cloud?

I want to run a pipeline (a previously staged template) in Google Cloud Dataflow (using the GAPI JS lib in a Google Cloud Function, as seen in https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/). How can I limit the resources this pipeline has access to? For example I don't want it to be able to write to all Pub/Subs, all buckets under the project etc. I don't even want the pipeline to be able to e.g. do a HTTP request at all.

From what I read at https://cloud.google.com/dataflow/security-and-permissions I can only do that when running the pipeline from the local machine because then the access rights are determined by my user access rights. However when run in the cloud it runs under Cloudservices Account and Compute Engine Service Account and those I cannot limit so that I don't break things elsewhere... Correct?

The reason I want this is that I am building a multitenant system that will utilize DF to ETL customers' data before it's available for querying. The pipelines will be authored (tailored to customer's data shapes) by data engineers/consultants and those can make mistakes -- the code must be, in principle, untrustworthy by default.

So how do I limit what it can and cannot do without executing it from a local machine? Completely separate projects? One project with severely limited rights and then assigning buckets and other resources one by one using cross-project access rights tuning? Or do I "simulate" local by setting up some micro instance with gcloud util installed and then run it from there using separate users?

Would using Dataproc instead (and accepting the price of lower abstraction and more devops work) help?

Solution

First of all, the user codes running in DF's worker VMs bear Compute Engine Service Account credentials by default, which is unrelated to who launched the job from where.

So basically your question can be reinterpreted as:

Put some network restrictions on Dataflow VMs.
Put some permission restrictions on the service account VM is using.

Two high level solutions here:

A: Put every customer's pipelines into a different project.

For every customer, create a new project.
Grant the compute engine service account only necessary permissions.
Jobs can be launched from anywhere, by using a service account which has the correct permission on that project (e.g., EDITOR).

B: Apply restrictions on a single pipeline, without creating a new project.

Create a new service account and grant it with correct permissions.
When launching the pipeline, use --serviceAccount to use the service account.

In both cases, pipelines can be created with --network to give you the flexibility to configure networks.

Solution A is better because you are building a multitenant service and isolation between customers may be very important. And it should be easier for you to configure correctly.