Search code examples
pythonazurepiiredactiontext-analytics-api

How to redact Indian Aadhaar Number, Brazilian CPF number or any other such PII using azure language service


I have been trying to redact PIIs in documents using Azure text analytics. It redacts many PIIs like name, address, contact number, ssn e.t.c. However, it does not redact Indian Aadhaar identification number, Brazilian CPF number, and many other PIIs. Even if it redacts, it incorrectly identifies them as some other entities like phone numbers.

The official documentation of Azure says that the Indian Aadhaar number is a recognized entity and can be redacted by setting the "piiCategories" parameter but I am unable to find this parameter.


Solution

  • How to redact Indian Aadhaar Number, Brazilian CPF number or any other such PII using azure language service.

    According to this MS-Document, To identify Brazil support pt-pt, pt-br document languages.

    You can use the below code which redact both Aadhar card and Brazil CPF number.

    Code:

    from azure.ai.textanalytics import TextAnalyticsClient
    from azure.core.credentials import AzureKeyCredential
    
    
    def authenticate_client():
        language_endpoint="https://xxxx.cognitiveservices.azure.com/"
        language_key="72xxxxxxx873476"
        ta_credential = AzureKeyCredential(language_key)
        text_analytics_client = TextAnalyticsClient(
                endpoint=language_endpoint, 
                credential=ta_credential)
        return text_analytics_client
    
    client = authenticate_client()
    
    def pii_recognition_example(client):
        documents = [
            "My Aadhaar number is 1234-5678-9101 and my CPF is 123.456.789-09"
        ]
        response = client.recognize_pii_entities(documents, language="pt-pt")
        result = [doc for doc in response if not doc.is_error]
        for doc in result:
            print("Redacted Text: {}".format(doc.redacted_text))
            for entity in doc.entities:
                print("Entity: {}".format(entity.text))
                print("\tCategory: {}".format(entity.category))
                print("\tConfidence Score: {}".format(entity.confidence_score))
                print("\tOffset: {}".format(entity.offset))
                print("\tLength: {}".format(entity.length))
    pii_recognition_example(client)
    

    Output:

    Redacted Text: My Aadhaar number is ************** and my CPF is **************
    Entity: 1234-5678-9101
            Category: PhoneNumber
            Confidence Score: 0.8
            Offset: 21
            Length: 14
    Entity: 123.456.789-09
            Category: BRCPFNumber
            Confidence Score: 0.85
            Offset: 50
            Length: 14
    

    enter image description here

    Reference: Quickstart: Detect Personally Identifying Information (PII) in text - Azure AI services | Microsoft Learn

    Update:

    You can get both category [INUniqueIdentificationNumber,BRCPFNumber] use the below code:

    from azure.ai.textanalytics import TextAnalyticsClient
    from azure.core.credentials import AzureKeyCredential
    
    # Authenticate the client using your key and endpoint
    def authenticate_client():
        language_endpoint = "https://xxx.cognitiveservices.azure.com/"
        language_key = "727xxxx9873476"
        ta_credential = AzureKeyCredential(language_key)
        text_analytics_client = TextAnalyticsClient(endpoint=language_endpoint, credential=ta_credential)
        return text_analytics_client
    
    client = authenticate_client()
    
    def pii_recognition_example(client):
        documents = [
            "My Aadhaar number is 4x39 xxxx9 9xx6 and my CPF is 123.456.789-09"
        ]
        
        response_pt = client.recognize_pii_entities(documents, language="pt")
        response_en = client.recognize_pii_entities(documents, language="en")
        
        result_pt = [doc for doc in response_pt if not doc.is_error]
        result_en = [doc for doc in response_en if not doc.is_error]
        
        for doc_pt, doc_en in zip(result_pt, result_en):
            # Prioritize redaction from both responses by merging
            combined_redacted_text = doc_pt.redacted_text  # Start with PT redacted text
            # Since we want both to be redacted, check and apply additional redactions from EN result
            for entity in doc_en.entities:
                if entity.text in combined_redacted_text:
                    combined_redacted_text = combined_redacted_text.replace(entity.text, '*' * len(entity.text))
    
            print("Redacted Text (Combined): {}".format(combined_redacted_text))
            
            # Collect and display categories from both responses
            categories = set()
            for entity in doc_pt.entities:
                categories.add(entity.category)
            for entity in doc_en.entities:
                categories.add(entity.category)
            
            for category in categories:
                print("\tCategory: {}".format(category))
    
    pii_recognition_example(client)
    

    Output:

      Redacted Text (Combined): My Aadhaar number is ************** and my CPF is **************
            Category: BRCPFNumber
            Category: INUniqueIdentificationNumber
    

    enter image description here