Search code examples
pythongoogle-cloud-platformgoogle-cloud-dlp

For the deidentify_with_fpe() Python API wrapper for google DLP what are the arguments needed to pass through?


I am working through the google cloud dlp api documentation available here specifically this question is about deidentify_with_fpe().

My question is what is the format of the arguments needing the be passed through the function for it to return anonymised data. My code at the moment is

def deidentify_with_fpe(
    string,
    info_types,
    alphabet=1,
    project='XXXX-data-development',
    surrogate_type=None,
    key_name='projects/XXXX-data-development/locations/global/keyRings/google-dlp-test-global/cryptoKeys/google-dlp-test-key-global',
    wrapped_key=WRAPPED
):
    
    "read file in for wrapped key"
    """Uses the Data Loss Prevention API to deidentify sensitive data in a
    string using Format Preserving Encryption (FPE).
    Args:
        project: The Google Cloud project id to use as a parent resource.
        item: The string to deidentify (will be treated as text).
        alphabet: The set of characters to replace sensitive ones with. For
            more information, see https://cloud.google.com/dlp/docs/reference/
            rest/v2beta2/organizations.deidentifyTemplates#ffxcommonnativealphabet
        surrogate_type: The name of the surrogate custom info type to use. Only
            necessary if you want to reverse the deidentification process. Can
            be essentially any arbitrary string, as long as it doesn't appear
            in your dataset otherwise.
        key_name: The name of the Cloud KMS key used to encrypt ('wrap') the
            AES-256 key. Example:
            key_name = 'projects/YOUR_GCLOUD_PROJECT/locations/YOUR_LOCATION/
            keyRings/YOUR_KEYRING_NAME/cryptoKeys/YOUR_KEY_NAME'
        wrapped_key: The encrypted ('wrapped') AES-256 key to use. This key
            should be encrypted using the Cloud KMS key specified by key_name.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    # Import the client library
    import google.cloud.dlp

    # Instantiate a client
    dlp = google.cloud.dlp_v2.DlpServiceClient(credentials='/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
    dlp = dlp_client.from_service_account_json('/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
    
    # Convert the project id into a full resource id.
    parent = dlp.project_path(project)

    # The wrapped key is base64-encoded, but the library expects a binary
    # string, so decode it here.
    import base64

   # wrapped_key = base64.b64decode(wrapped_key)

    # Construct FPE configuration dictionary
    crypto_replace_ffx_fpe_config = {
        "crypto_key": {
            "kms_wrapped": {
                "wrapped_key": wrapped_key,
                "crypto_key_name": key_name,
            }
        },
        "common_alphabet": alphabet,
    }

    # Add surrogate type
    if surrogate_type:
        crypto_replace_ffx_fpe_config["surrogate_info_type"] = {
            "name": surrogate_type
        }

    # Construct inspect configuration dictionary
    inspect_config = {
        "info_types": [{"name": info_type} for info_type in info_types]
    }

    # Construct deidentify configuration dictionary
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "crypto_replace_ffx_fpe_config": crypto_replace_ffx_fpe_config
                    }
                }
            ]
        }
    }

    # Convert string to item
    item = {"value": string}

    # Call the API
    response = dlp.deidentify_content(
        parent,
        inspect_config=inspect_config,
        deidentify_config=deidentify_config,
        item=item,
    )

    # Print results
    print(response.item.value)

Where

with open('mysecret.txt.encrypted', 'rb') as f:
    WRAPPED = f.read()

and the mysecret.txt.encrypted was generated by this command in the terminal

--keyring google-dlp-test-global --key google-dlp-test-key-global \
--plaintext-file google-token.txt \
--ciphertext-file mysecret.txt.encrypted

When the google-token.txt was generated from here.

The error I am getting when calling deidentify_with_fpe('My name is john smith', ['FIRST_NAME']) is as follows:

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.INVALID_ARGUMENT
    details = "Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered."
    debug_error_string = "{"created":"@1581675678.839972000","description":"Error received from peer ipv4:216.58.213.10:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.","grpc_status":3}"

which is a direct cause of:

InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.

So I think my issue is to do with the key - before it is encrypted. There is no where I can see in the documentation for how to source that key, or how to pass that into the function.

I appreciate this is a long and lengthy submission and any response would be appreciated, I've spent too long trying to do this and feel like I'm close to getting it to work


Solution

  • The error: “google.api_core.exceptions.InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.”

    This is a generic error when free-form text de-identification fails due to some transformation errors. Unfortunately, it seems like the python library is not exposing the error details.

    As per the service documentation [1], the detected tokens must be at least two characters long:

    The input value:
    
    - Must be at least two characters long (or the empty string).
    - Must be encoded as ASCII.
    - Comprised of the characters specified by an "alphabet," which is the set of between 2 and 64 allowed characters in the input value. For more information, see the alphabet field in CryptoReplaceFfxFpeConfig.
    
    
    [1] https://cloud.google.com/dlp/docs/transformations-reference#fpe