Search code examples
google-cloud-platformgoogle-bigquerystandards-compliancegoogle-cloud-dlppii

How to run Cloud DLP (Data Loss Prevention) in all Big Query tables in my project?


As per the DLP docs, when you create an Inspect Job, you need to specify the table reference:

{
  "inspectJob":{
    "storageConfig":{
      "bigQueryOptions":{
        "tableReference":{
          "projectId":"bigquery-public-data",
          "datasetId":"usa_names",
          "tableId":"usa_1910_current"
        },
        "rowsLimit":"1000",
        "sampleMethod":"RANDOM_START",
        "identifyingFields":[
          {
            "name":"name"
          }
        ]
      }
    },
    "inspectConfig":{
      "infoTypes":[
        {
          "name":"FIRST_NAME"
        }
      ],
      "includeQuote":true
    },
    "actions":[
      {
        "saveFindings":{
          "outputConfig":{
            "table":{
              "projectId":"[PROJECT-ID]",
              "datasetId":"testingdlp",
              "tableId":"bqsample3"
            },
            "outputSchema":"BASIC_COLUMNS"
          }
        }
      }
    ]
  }
}

This means I'd need to create one Inspect Job for each table, I want to find sensitive data in all my Big Query resources, how to do it?


Solution

  • To run DLP in all your Big Query resources, you have two options.

    • Programmatically fetch your Big Query tables, and then trigger one Inspect Job for each table.

      Pros: Cheaper, 1 GB to 50 terabytes (TB) - $1.00 Price per gigabyte (GB)

      Cons: It's a batch operation, so it's not executed on real time.

      Python Sample, with the idea:

      client = bigquery.Client()
      datasets = list(client.list_datasets(project=project_id))
      
      if datasets:
          for dataset in datasets:
              tables = client.list_tables(dataset.dataset_id)
              for table in tables:
                  # Create Inspect Job for table.table_id
      
    • Programmatically fetch your Big Query tables, query your table and call DLP Streaming Content API.

      Pros: It's a real time operation.

      Cons: More expensive, Over 1 GB - $3.00 Price per gigabyte (GB)

      Java Sample, with the idea:

      url =
          String.format(
              "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
              projectId);
      DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
      ds.setURL(url);
      conn = ds.getConnection();
      DatabaseMetaData databaseMetadata = conn.getMetaData();
      ResultSet tablesResultSet =
          databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
      while (tablesResultSet.next()) {
      // Query your Table Data and call DLP Streaming API
      }
      

    For a full tutorial on the second option, there's a blog post talking about it.

    Beware: "it is possible for costs to become very high, depending on the quantity of information that you instruct the Cloud DLP to scan. To learn several methods that you can use to keep costs down while also ensuring that you're using the Cloud DLP to scan the exact data that you intend to, see Keeping Cloud DLP costs under control."

    The billing information was up to date at the time of this writing, for the most updated info, check the DLP billing docs page.