Search code examples
python-3.xnlplstm

Replace specific text with a redacted version using Python


I am looking to do the opposite of what has been done here:

import re

text = '1234-5678-9101-1213 1415-1617-1819-hello'

re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)

output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'

Partial replacement with re.sub()

My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.

The end result would look like:

XXXX went to XXXX XXXXXX

Sponge Bob went to Disney World.

In short, I am unmasking text and replacing it with a generated dataset using fuzzy.


Solution

  • You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.

    NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.

    Example:

    Sponge Bob went to South beach, he payed a ticket of $200!
    I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.

    Returns:

    NER with spacy

    Just be aware that this is not 100%!

    Here are a little snippet for you to try out:

    import spacy
    
    phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
    nlp = spacy.load('en')
    for phrase in phrases:
       doc = nlp(phrase)
       replaced = ""
       for token in doc:
          if token in doc.ents:
             replaced+="XXXX "
          else:
             replaced+=token.text+" "
    
    

    Read more here: https://spacy.io/usage/linguistic-features#named-entities

    You could, instead of replacing with XXXX, replace based on the entity type, like:

    if ent.label_ == "PERSON":
       replaced += "<PERSON> "
    

    Then:

    import re, random
    
    personames = ["Jack", "Mike", "Bob", "Dylan"]
    
    phrase = re.replace("<PERSON>", random.choice(personames), phrase)