Replace specific text with a redacted version using Python

I am looking to do the opposite of what has been done here:

import re

text = '1234-5678-9101-1213 1415-1617-1819-hello'

re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)

output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'

Partial replacement with re.sub()

My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.

The end result would look like:

XXXX went to XXXX XXXXXX

Sponge Bob went to Disney World.

In short, I am unmasking text and replacing it with a generated dataset using fuzzy.

Solution

You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.

NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.

Example:

Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.

Returns:

Just be aware that this is not 100%!

Here are a little snippet for you to try out:

import spacy

phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
   doc = nlp(phrase)
   replaced = ""
   for token in doc:
      if token in doc.ents:
         replaced+="XXXX "
      else:
         replaced+=token.text+" "

You could, instead of replacing with XXXX, replace based on the entity type, like:

if ent.label_ == "PERSON":
   replaced += "<PERSON> "

Then:

import re, random

personames = ["Jack", "Mike", "Bob", "Dylan"]

phrase = re.replace("<PERSON>", random.choice(personames), phrase)