Search code examples
pythonazureazure-cognitive-servicesnamed-entity-recognitionpii

How can I allow certain entities (e.g., names, organizations) in Azures PII Entity Recognition method so that they are not recognized/masked?


I am using Azures PII Entity Recognition method in Python to recognize PII entities in a list of documents.

  1. I am wondering if there is a way to pass a list of entities to the method, which will then not be recognized as PII information. These will be, e.g., names/organizations which are not sensitive in the context.

  2. I would like my PII entities to be replaced with the category, rather than masked with a masking character (e.g., "Andrew" will become "<PERSON>" rather than "******"). I have currently solved this problem by adding my own method and looping through the responses. I am wondering, however, if there is a better way.

Here is an example:

# Function to replace detected PII entities with their respective categories
def replace_with_category(document, doc_result):
    """Replace PII entities in the document with their categories."""
    redacted_text = document
    for entity in sorted(doc_result.entities, key=lambda e: e.offset, reverse=True):
        redacted_text = redacted_text[:entity.offset] + f"<{entity.category.upper()}>" + redacted_text[entity.offset + entity.length:]
    return redacted_text
    
# Function to redact PII entities in a list of documents
def pii_redact_list(documents, language):
     """This function takes the list of 5 documents replaces all the PII entities with their respective categories. The result is that rather than ****** the category e.g., <ORGANISATION> is listed in the string.
    The function first detects the language of all the documents. Next, it recognizes the PII entities. Finally it replaces the PII entities with their categories."""
    responses = azure_text_analytics_client.recognize_pii_entities(documents, categories_filter=pii_categories, language=language)
    redacted_texts = []
    for idx in range(0, len(responses)):
        doc_text = documents[idx]
        doc_result = responses[idx]
        redacted_text = self.replace_with_category(doc_text, doc_result)
        redacted_texts.append(redacted_text)
    return redacted_texts

Solution

  • Point 1: You can simply check at the moment when you replace your items using the result:

    for entity in sorted(doc_result.entities, key=lambda e: e.offset, reverse=True):
            redacted_text = redacted_text[:entity.offset] + f"<{entity.category.upper()}>" + redacted_text[entity.offset + entity.length:]
    

    you can check if your entity.text matches one of the values you would like to keep

    Point 2: this looks like a correct way