Search code examples
google-cloud-platformgoogle-cloud-dlp

Does GCP Data Loss Prevention support publishing its results to Data Catalog for External Big Query Tables


I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that

  1. If i select any sources other than Big Query table (GCS, Data Store etc.), the option to publish GCP DLP inspection results to Data Catalog is disabled.
  2. If i select Big Query table, Data Catalog publish option is enabled but when i try to run the inspection job, its errors out saying , "External tables are not supported for inspection". Surprisingly it supports only internal big query tables.

The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!

Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?


Solution

  • Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.

    Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:

    • Programmatically fetch the Big Query tables, query the table and call DLP Streaming Content API. It operates in real time, but it is expensive. Here I share the concept in a Java example code:
    url =
        String.format(
            "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
            projectId);
    DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
    ds.setURL(url);
    conn = ds.getConnection();
    DatabaseMetaData databaseMetadata = conn.getMetaData();
    ResultSet tablesResultSet =
        databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
    while (tablesResultSet.next()) {
    // Query your Table Data and call DLP Streaming API
    }
    

    Here is a tutorial for this method.

    • Programmatically fetch the Big Query tables, and then trigger one Inspect Job for each table. It is the cheapest method, but you need to consider that it's a batch operation, so it doesn’t execute in real time. Here is the concept in a Python example:
    client = bigquery.Client()
    datasets = list(client.list_datasets(project=project_id))
     
    if datasets:
        for dataset in datasets:
            tables = client.list_tables(dataset.dataset_id)
            for table in tables:
                # Create Inspect Job for table.table_id
    

    Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.