I am calling the DLP API to mask person names and email addresses in text, using the following request:
Request
{
"item": {
"value": "Eleanor Rigby\nPharmacist\[email protected]"
},
"deidentifyConfig": {
"infoTypeTransformations": {
"transformations": [
{
"infoTypes": [ { "name": "EMAIL_ADDRESS" } ],
"primitiveTransformation": {
"characterMaskConfig": {
"maskingCharacter": "#",
"reverseOrder": false,
"charactersToIgnore": [
{
"charactersToSkip": ".@"
}
]
}
}
},
{
"infoTypes": [ { "name": "PERSON_NAME" } ],
"primitiveTransformation": {
"replaceConfig": {
"newValue": {
"stringValue": "(person)"
}
}
}
}
]
}
},
"inspectConfig": {
"infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PERSON_NAME" } ]
}
}
API call
curl -s \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://dlp.googleapis.com/v2/projects/$PROJECT_ID/content:deidentify \
-d @gcp-dlp/input/text-request.json
Response
{
"item": {
"value": "(person)\nPharmacist\n(person)#######.#####@#######.###(person)"
},
"overview": {
"transformedBytes": "50",
"transformationSummaries": [
{
"infoType": {
"name": "EMAIL_ADDRESS"
},
"transformation": {
"characterMaskConfig": {
"maskingCharacter": "#",
"charactersToIgnore": [
{
"charactersToSkip": ".@"
}
]
}
},
"results": [
{
"count": "1",
"code": "SUCCESS"
}
],
"transformedBytes": "25"
},
{
"infoType": {
"name": "PERSON_NAME"
},
"transformation": {
"replaceConfig": {
"newValue": {
"stringValue": "(person)"
}
}
},
"results": [
{
"count": "3",
"code": "SUCCESS"
}
],
"transformedBytes": "25"
}
]
}
}
Request (text only)
Eleanor Rigby
Pharmacist
[email protected]
Response (text only)
(person)
Pharmacist
(person)#######.#####@#######.###(person)
The input text contains a person name and an email address. Both are detected and masked as expected. However, additional (person)
tags are added before and after the masked email address.
This is a very simple example, but I observed this behavior in every document I processed this way.
Why is the person entity detected multiple times?
This issue was reported at Google Public Issue Tracker, such requests aren't indexed, but it's a good way to report issues or request new features. Please follow this case to be updated.
There's a workaround suggested by Google:
This is a case where we have some undefined behavior when findings overlap. The person comes from the user's configuration to replace people name with person.
They can omit the overlaps.
For more information, please have a look at the documentation Modifying infoType detectors to refine scan results section Omit matches on PERSON_NAME detector if also matched by EMAIL_ADDRESS detector:
The following JSON snippet and code in several languages illustrate how to indicate to Cloud DLP using an InspectConfig that it should only return one match in the case that matches for the
PERSON_NAME
detector overlap with matches for theEMAIL_ADDRESS
detector. Doing this is to avoid the situation where an email address such as "[email protected]" matches on both thePERSON_NAME
andEMAIL_ADDRESS
detectors.... "inspectConfig":{ "ruleSet":[ { "infoTypes":[ { "name":"PERSON_NAME" } ], "rules":[ { "exclusionRule":{ "excludeInfoTypes":{ "infoTypes":[ { "name":"EMAIL_ADDRESS" } ] }, "matchingType": "MATCHING_TYPE_PARTIAL_MATCH" } } ] } ] } ...