Different results from DLP API depending on if input is all in one string or sent in as collection of substrings

I'm seeing a behavior in the Google DLP library that puzzles me, and I'm hoping for some clarification. I'm using the Java wrapper library, google-cloud-dlp version 0.34.0-beta. Given the input:

Collection<String> input = Lists.newArrayList("Jenny Tutone  2665 Agua Vista Dr Los Gatos CA 95030 (408) 867-5309 or 408.867.5309x100"

I'm seeing the output:

███  █ ████ or █

If I pass in the same string as a collection of substrings:

Collection<String> input = Lists.newArrayList("Jenny Tutone", "2665 Agua Vista Dr", "Los Gatos", "CA 95030", "(408) 867-5309", "or", "408.867.5309x100");

I see very different results:

███, 2665 █, █ Gatos, █ 95030, █, or, █

I'm using all the InfoType types that I could find, which amounts to 67 of them. Am I doing something wrong here? This is the meat of the code that invokes the Google DLP library:

private Collection<String> redactContent(Collection<String> input,
                                String replacement,
                                Likelihood minLikelihood,
                                List<InfoType> infoTypes) {
    // Replace select info types with chosen replacement string
    final Collection<RedactContentRequest.ReplaceConfig> replaceConfigs = infoTypes.stream()
            .map(it -> RedactContentRequest.ReplaceConfig.newBuilder().setInfoType(it).setReplaceWith(replacement).build())
            .collect(Collectors.toCollection(LinkedList::new));

    final InspectConfig inspectConfig =
            InspectConfig.newBuilder()
                    .addAllInfoTypes(infoTypes)
                    .setMinLikelihood(minLikelihood)
                    .build();

    long itemCount = 0;

    try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
        // Google's DLP library is limited to 100 items per request, so the requests need to be chunked if the
        // number of input items is greater.

        Stream.Builder<Stream<ContentItem>> streamBuilder = Stream.builder();

        for (long processed = 0; processed < input.size(); processed += maxItemsPerRequest) {
            Collection<ContentItem> items =
                    input.stream()
                            .skip(processed)
                            .limit(maxItemsPerRequest)
                            .filter(item -> item != null && !item.isEmpty())
                            .map(item ->
                                    ContentItem.newBuilder()
                                            .setType(MediaType.PLAIN_TEXT_UTF_8.toString())
                                            .setData(ByteString.copyFrom(item.getBytes(Charset.forName("UTF-8"))))
                                            .build()
                            )
                            .collect(Collectors.toCollection(LinkedList::new));
            RedactContentRequest request = RedactContentRequest.newBuilder()
                    .setInspectConfig(inspectConfig)
                    .addAllItems(Collections.unmodifiableCollection(items))
                    .addAllReplaceConfigs(replaceConfigs)
                    .build();

            RedactContentResponse contentResponse = dlpClient.redactContent(request);
            itemCount += contentResponse.getItemsCount();
            streamBuilder.add(contentResponse.getItemsList().stream());
        }

        return streamBuilder.build()
                        .flatMap(stream -> stream.map(item -> item.getData().toStringUtf8()))
                        .collect(Collectors.toCollection(LinkedList::new));
    }
}

Solution

Context can influence findings. Also in the case of an address, parts of the address may influence other parts. For example "Mountain View CA 94043" may match as a LOCATION but just "94043" by itself may not. When running this analysis, we don't cross cell boundaries when deciding on context and so in your second ArrayList example each string is looked at individually (in its own context).

Note: I am the PM for the DLP API.