I have a script where i annonymize personal data, so when a string have some words that start with capital letters it replace them by another function ( that is to annonymize names)
I want to write a function where the regex is looking for words that is given in a list. When a string has one of the words in the given list it should not replaced. To give an example: Mijn naam is kim en ik heb een opleiding gevolgd aan de Universiteit van Amsterdam
So because Universiteit van Amsterdam is been write with capital letters is will be annonymized by annother function. I want to make an extra function that use Regex where an given list with certain words will be ignored when when a string matches the words in the list.
I have a function that replace it, but i want that the matches words are ignored.
This is the function that anonymizeNames**
def anonymizeNames(sentence):
'''
:param sentence: the input sentence
:return: the sentence without names
'''
##define x
x = ""
##Check naam: indication
names0Reg = "[Aa]chternaam:|[Vv]oornaam:|[Nn]aam:|[Nn]amen:"
res = re.search(names0Reg, sentence)
if res != None:
##Achternaam:, voornaam: or naam: or namen: occurs; next Standardize
sentence = re.sub('[Nn]amen:', 'naam:', sentence)
sentence = re.sub('[Aa]chternaam:', 'naam:', sentence)
sentence = re.sub('[Vv]oornaam:', 'naam:', sentence)
sentence = re.sub('Naam:', 'naam:', sentence)
##Extract names
names00Reg = "naam: [A-Za-z]+"
x = re.findall(names00Reg, sentence)
for y in x:
##remove naam:\s
y = re.sub('naam: ', '', y)
##Check for tussenvoegsels
if y in tussenVList:
##Add next word
regTest = y + " " + "[A-Za-z]+"
x2 = re.search(regTest, sentence)
if x2 != None:
##Name found
y = x2.group()
##replace
sentence = re.sub(y, strz, sentence)
##Always check sentences for names 1
names1Reg = "[Ii]k [Bb]en ([A-Z]{1}[a-z ]{2,})+[\\.\\,]*"
res = re.search(names1Reg, sentence)
if res != None:
##adjust result
x = re.sub('[Ii]k [Bb]en ', '', res.group())
x = re.sub('[\\,\\.]', '', x)
##use NLP to only keep names
##Always check sentences for names 2
names2Reg = "[Mm]ijn [Nn]aam is ([A-Z]{1}[a-z\s-]{2,})+[\\.\\,]*"
res = re.search(names2Reg, sentence)
if res != None:
##adjust result
x = re.sub('[Mm]ijn [Nn]aam is ', '', res.group())
x = re.sub('[\\,\\.]', '', x)
##use NLP to only keep names
##Check for single letter followed by dot and series of letters
if x == "":
regNameLet = "^[A-Z]{1}\\.[A-Za-z]{2,}|\s[A-Z]{1}\\.[A-Za-z]{2,}"
res = re.search(regNameLet, sentence)
if res != None:
##replace word in sentence, first at start
sentence = re.sub('^[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
##next in sentence with additional space
strY = " " + strz
sentence = re.sub('\s[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
##Check for occurence of two subsequent uppercase words (might be a name)
if x == "":
res = re.findall("[A-Z]{1}[a-z]{2,}\s[A-Z]{1}[a-z]{2,}", sentence)
if res != []:
for y in res:
if len(y) > 1:
##replace name with strX
sentence = re.sub(y, strz, sentence)
##Always recheck remaining sentence with NLP to make sure all personal info is removed
sentence = pureNLP2(sentence) ##pureNLP2 tries to include entity checks
return (sentence)
This is my function for finding names of university and with this function i don't want to replace them
school ['Hogenschool Amsterdam', 'Universiteit van Amsterdam']
strX='xxx'
def school (sentence):
for schoolname in school:
res = re.findall(schoolname,sentence)
if res !=[]:
for y in res:
if len(y) >1:
sentence = replaceNice(sentence, strX, y)
return(sentence)
print(school('Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam'))
output : Mijn naam xxx en ik volg een opleiding aan de xxx xxx
The output that i want is:
Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam
I think i have a start. But when i want to finish the variable sentence im a bit stuck, because here i want to say if the string has matching words from the list School don't replace it, but just print it back.
Replace all the safe words with a lower case version, then anonymize, then restore the lower cased safe words to their original form.
test_strings = ['Adam goes to Universiteit van Amsterdam', 'George goes to Washington College', 'Anthony Hopkins is a student at Johns Hopkins']
safe_words = ['Universiteit van Amsterdam', 'Johns Hopkins', 'Washington College']
def anonymize(sentence, safe_words):
restore = {}
for word in safe_words:
sentence = sentence.replace(word, word.lower())
restore[word.lower()] = word
for word in sentence.split():
if word[0].isupper():
sentence = sentence.replace(word, word[0]+'.')
for word, restored_word in restore.items():
sentence = sentence.replace(word, restored_word)
return sentence
for sentence in test_strings:
print(anonymize(sentence, safe_words))
Output:
A. goes to Universiteit van Amsterdam
G. goes to Washington College
A. H. is a student at Johns Hopkins