python azure pii redaction text-analytics-api

How to redact Indian Aadhaar Number, Brazilian CPF number or any other such PII using azure language service

I have been trying to redact PIIs in documents using Azure text analytics. It redacts many PIIs like name, address, contact number, ssn e.t.c. However, it does not redact Indian Aadhaar identification number, Brazilian CPF number, and many other PIIs. Even if it redacts, it incorrectly identifies them as some other entities like phone numbers.

The official documentation of Azure says that the Indian Aadhaar number is a recognized entity and can be redacted by setting the "piiCategories" parameter but I am unable to find this parameter.

Solution

How to redact Indian Aadhaar Number, Brazilian CPF number or any other such PII using azure language service.

According to this MS-Document, To identify Brazil support pt-pt, pt-br document languages.

You can use the below code which redact both Aadhar card and Brazil CPF number.

Code:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential


def authenticate_client():
    language_endpoint="https://xxxx.cognitiveservices.azure.com/"
    language_key="72xxxxxxx873476"
    ta_credential = AzureKeyCredential(language_key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=language_endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

def pii_recognition_example(client):
    documents = [
        "My Aadhaar number is 1234-5678-9101 and my CPF is 123.456.789-09"
    ]
    response = client.recognize_pii_entities(documents, language="pt-pt")
    result = [doc for doc in response if not doc.is_error]
    for doc in result:
        print("Redacted Text: {}".format(doc.redacted_text))
        for entity in doc.entities:
            print("Entity: {}".format(entity.text))
            print("\tCategory: {}".format(entity.category))
            print("\tConfidence Score: {}".format(entity.confidence_score))
            print("\tOffset: {}".format(entity.offset))
            print("\tLength: {}".format(entity.length))
pii_recognition_example(client)

Output:

Redacted Text: My Aadhaar number is ************** and my CPF is **************
Entity: 1234-5678-9101
        Category: PhoneNumber
        Confidence Score: 0.8
        Offset: 21
        Length: 14
Entity: 123.456.789-09
        Category: BRCPFNumber
        Confidence Score: 0.85
        Offset: 50
        Length: 14

enter image description here

Reference: Quickstart: Detect Personally Identifying Information (PII) in text - Azure AI services | Microsoft Learn

Update:

You can get both category [INUniqueIdentificationNumber,BRCPFNumber] use the below code:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint
def authenticate_client():
    language_endpoint = "https://xxx.cognitiveservices.azure.com/"
    language_key = "727xxxx9873476"
    ta_credential = AzureKeyCredential(language_key)
    text_analytics_client = TextAnalyticsClient(endpoint=language_endpoint, credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

def pii_recognition_example(client):
    documents = [
        "My Aadhaar number is 4x39 xxxx9 9xx6 and my CPF is 123.456.789-09"
    ]
    
    response_pt = client.recognize_pii_entities(documents, language="pt")
    response_en = client.recognize_pii_entities(documents, language="en")
    
    result_pt = [doc for doc in response_pt if not doc.is_error]
    result_en = [doc for doc in response_en if not doc.is_error]
    
    for doc_pt, doc_en in zip(result_pt, result_en):
        # Prioritize redaction from both responses by merging
        combined_redacted_text = doc_pt.redacted_text  # Start with PT redacted text
        # Since we want both to be redacted, check and apply additional redactions from EN result
        for entity in doc_en.entities:
            if entity.text in combined_redacted_text:
                combined_redacted_text = combined_redacted_text.replace(entity.text, '*' * len(entity.text))

        print("Redacted Text (Combined): {}".format(combined_redacted_text))
        
        # Collect and display categories from both responses
        categories = set()
        for entity in doc_pt.entities:
            categories.add(entity.category)
        for entity in doc_en.entities:
            categories.add(entity.category)
        
        for category in categories:
            print("\tCategory: {}".format(category))

pii_recognition_example(client)

Output:

  Redacted Text (Combined): My Aadhaar number is ************** and my CPF is **************
        Category: BRCPFNumber
        Category: INUniqueIdentificationNumber