Search code examples
google-cloud-platformgoogle-cloud-dlp

Data Loss Prevention finds superfluous entities when masking email


I am calling the DLP API to mask person names and email addresses in text, using the following request:

Request

{
  "item": {
    "value": "Eleanor Rigby\nPharmacist\[email protected]"
  },
  "deidentifyConfig": {
    "infoTypeTransformations": {
      "transformations": [
        {
          "infoTypes": [ { "name": "EMAIL_ADDRESS" } ],
          "primitiveTransformation": {
            "characterMaskConfig": {
              "maskingCharacter": "#",
              "reverseOrder": false,
              "charactersToIgnore": [
                {
                  "charactersToSkip": ".@"
                }
              ]
            }
          }
        },
        {
          "infoTypes": [ { "name": "PERSON_NAME" } ],
          "primitiveTransformation": {
            "replaceConfig": {
              "newValue": {
                "stringValue": "(person)"
              }
            }
          }
        }
      ]
    }
  },
  "inspectConfig": {
    "infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PERSON_NAME" } ]
  }
}

API call

curl -s \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://dlp.googleapis.com/v2/projects/$PROJECT_ID/content:deidentify \
  -d @gcp-dlp/input/text-request.json

Response

{
  "item": {
    "value": "(person)\nPharmacist\n(person)#######.#####@#######.###(person)"
  },
  "overview": {
    "transformedBytes": "50",
    "transformationSummaries": [
      {
        "infoType": {
          "name": "EMAIL_ADDRESS"
        },
        "transformation": {
          "characterMaskConfig": {
            "maskingCharacter": "#",
            "charactersToIgnore": [
              {
                "charactersToSkip": ".@"
              }
            ]
          }
        },
        "results": [
          {
            "count": "1",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      },
      {
        "infoType": {
          "name": "PERSON_NAME"
        },
        "transformation": {
          "replaceConfig": {
            "newValue": {
              "stringValue": "(person)"
            }
          }
        },
        "results": [
          {
            "count": "3",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      }
    ]
  }
}

Request (text only)

Eleanor Rigby
Pharmacist
[email protected]

Response (text only)

(person)
Pharmacist
(person)#######.#####@#######.###(person)

The input text contains a person name and an email address. Both are detected and masked as expected. However, additional (person) tags are added before and after the masked email address.

This is a very simple example, but I observed this behavior in every document I processed this way.

Why is the person entity detected multiple times?


Solution

  • This issue was reported at Google Public Issue Tracker, such requests aren't indexed, but it's a good way to report issues or request new features. Please follow this case to be updated.

    There's a workaround suggested by Google:

    This is a case where we have some undefined behavior when findings overlap. The person comes from the user's configuration to replace people name with person.

    They can omit the overlaps.

    For more information, please have a look at the documentation Modifying infoType detectors to refine scan results section Omit matches on PERSON_NAME detector if also matched by EMAIL_ADDRESS detector:

    The following JSON snippet and code in several languages illustrate how to indicate to Cloud DLP using an InspectConfig that it should only return one match in the case that matches for the PERSON_NAME detector overlap with matches for the EMAIL_ADDRESS detector. Doing this is to avoid the situation where an email address such as "[email protected]" matches on both the PERSON_NAME and EMAIL_ADDRESS detectors.

    ...
        "inspectConfig":{
          "ruleSet":[
            {
              "infoTypes":[
                {
                  "name":"PERSON_NAME"
                }
              ],
              "rules":[
                {
                  "exclusionRule":{
                    "excludeInfoTypes":{
                      "infoTypes":[
                        {
                          "name":"EMAIL_ADDRESS"
                        }
                      ]
                    },
                    "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
                  }
                }
              ]
            }
          ]
        } 
    ...