Search code examples
google-cloud-dlp

Obtaining the number of items de-identified?


I'm trying to report the exact number of de-identified words/strings from the de-identification result from Google DLP Java library. I'm using this on the response:

DeidentifyContentResponse response = dlpClient.deidentifyContent(request);
// Sum up the redactions
List<TransformationSummary> summaries =
        response.getOverview().getTransformationSummariesList();
int redactionCount = 0;

if (!isEmpty(summaries)) {
    redactionCount = summaries.stream()
            .mapToInt(TransformationSummary::getResultsCount)
            .sum();
}

I'm sending the input as a Table where each input string is one row, regardless of how many words/columns are in it. The redaction count seems to mostly match what I expect, but in some cases the count seems to be off. For example, the input Steve Jobs yields a redactionCount of 3 when I use the code shown above. I'm guessing that the reason is that it matches more than one InfoType. In my case I have FIRST_NAME, LAST_NAME, and PERSON_NAME in my list of InfoTypes, so I'm guessing I get one match for the first name, another for the last name, and a third match for the "person name" in its entirety. What I'm looking for is essentially how many words were redacted/de-identified. I.e. I would expect the redactionCount result to be == 2. Is there a better/easier way of doing this?


Solution

  • You are right that natively, the transformation summary is the number of transformations, not number of words transformed, but what you point out here is also a bug I've filed with the team.

    For some some transparency and detail in the bug, the issue is that it's not correctly handling the scenario of overlapping findings. We can fix that ... and in the meantime if you drop PERSON_NAME from your request you'll get the behavior you were seeking.

    (Of note, even with the overlapping bug fixed, if you asked for person_name and not the other two, you will end up with a single transformation.) It's of course possible that a first name spans multiple words so it's not going to currently always give you a word count.