Search code examples
pythongoogle-cloud-dlp

gcp dlp python / how to reduce likelyhood when a column does not contain a string


I have a numeric client id to find. I created a custom info types :

custom_info_types = [
    {
        "info_type": {"name": "CLIENTID"},
        "regex": {"pattern": r'\d{7,8}'},
    }
]

As expected, a lot of findings came out from the job and all with a very_likely likelyhood.

To reduce the findings, I'd like to use hotwords in "reverse" mode : if there's not the string "cli" in the column name, then reduce likelyhood.

In the documentation there are examples on how to do the opposite, but as every findings has a "VERY_LIKELY" likelyhood, it does not help.

hotword_rule = {
    "hotword_regex": {"pattern": "(?i)(.*cli.*)(?-i)"},
    "likelihood_adjustment": {
        "fixed_likelihood": dlp_v2.Likelihood.VERY_LIKELY
    },
    "proximity": {"window_before": 1},
}

Is there any solution to do what I want ?

Thanks for your help !


Solution

  • In order to accomplish this you want to set the default likelihood for your custom_info_type to be VERY_UNLIKELY and then keep your hotword rule as-is. This way if something matches it will flag as VERY_UNLIKELY unless the header/context contains your match for "cli" in which case it will boost to VERY_LIKELY.

    Something like:

    custom_info_types = [
        {
            "info_type": {"name": "CLIENTID"},
            "regex": {"pattern": r'\d{7,8}'},
            "likelihood": "VERY_UNLIKELY"
        }
    ]
    

    When you leave the likelihood blank in the custom_info_type definition, then it defaults to VERY_LIKELY.

    Let me know if this works.