I am using Azures PII Entity Recognition method in Python to recognize PII entities in a list of documents.
I am wondering if there is a way to pass a list of entities to the method, which will then not be recognized as PII information. These will be, e.g., names/organizations which are not sensitive in the context.
I would like my PII entities to be replaced with the category, rather than masked with a masking character (e.g., "Andrew" will become "<PERSON>" rather than "******"). I have currently solved this problem by adding my own method and looping through the responses. I am wondering, however, if there is a better way.
Here is an example:
# Function to replace detected PII entities with their respective categories
def replace_with_category(document, doc_result):
"""Replace PII entities in the document with their categories."""
redacted_text = document
for entity in sorted(doc_result.entities, key=lambda e: e.offset, reverse=True):
redacted_text = redacted_text[:entity.offset] + f"<{entity.category.upper()}>" + redacted_text[entity.offset + entity.length:]
return redacted_text
# Function to redact PII entities in a list of documents
def pii_redact_list(documents, language):
"""This function takes the list of 5 documents replaces all the PII entities with their respective categories. The result is that rather than ****** the category e.g., <ORGANISATION> is listed in the string.
The function first detects the language of all the documents. Next, it recognizes the PII entities. Finally it replaces the PII entities with their categories."""
responses = azure_text_analytics_client.recognize_pii_entities(documents, categories_filter=pii_categories, language=language)
redacted_texts = []
for idx in range(0, len(responses)):
doc_text = documents[idx]
doc_result = responses[idx]
redacted_text = self.replace_with_category(doc_text, doc_result)
redacted_texts.append(redacted_text)
return redacted_texts
Point 1: You can simply check at the moment when you replace your items using the result:
for entity in sorted(doc_result.entities, key=lambda e: e.offset, reverse=True):
redacted_text = redacted_text[:entity.offset] + f"<{entity.category.upper()}>" + redacted_text[entity.offset + entity.length:]
you can check if your entity.text
matches one of the values you would like to keep
Point 2: this looks like a correct way