Search code examples
pythonregexfunctionloops

Function where words don't be replaced using Regex


I have a script where i annonymize personal data, so when a string have some words that start with capital letters it replace them by another function ( that is to annonymize names)

I want to write a function where the regex is looking for words that is given in a list. When a string has one of the words in the given list it should not replaced. To give an example: Mijn naam is kim en ik heb een opleiding gevolgd aan de Universiteit van Amsterdam

So because Universiteit van Amsterdam is been write with capital letters is will be annonymized by annother function. I want to make an extra function that use Regex where an given list with certain words will be ignored when when a string matches the words in the list.
I have a function that replace it, but i want that the matches words are ignored.

This is the function that anonymizeNames**

def anonymizeNames(sentence):
    '''
        :param sentence: the input sentence
        :return: the sentence without names
    '''

    ##define x
    x = ""

    ##Check naam: indication
    names0Reg = "[Aa]chternaam:|[Vv]oornaam:|[Nn]aam:|[Nn]amen:"
    res = re.search(names0Reg, sentence)
    if res != None:
        ##Achternaam:, voornaam: or naam: or namen: occurs; next Standardize
        sentence = re.sub('[Nn]amen:', 'naam:', sentence)
        sentence = re.sub('[Aa]chternaam:', 'naam:', sentence)
        sentence = re.sub('[Vv]oornaam:', 'naam:', sentence)
        sentence = re.sub('Naam:', 'naam:', sentence)

        ##Extract names
        names00Reg = "naam: [A-Za-z]+"
        x = re.findall(names00Reg, sentence)
        for y in x:
            ##remove naam:\s
            y = re.sub('naam: ', '', y)
            ##Check for tussenvoegsels
            if y in tussenVList:
                ##Add next word
                regTest = y + " " + "[A-Za-z]+"
                x2 = re.search(regTest, sentence)
                if x2 != None:
                    ##Name found
                    y = x2.group()
                    ##replace
                   sentence = re.sub(y, strz, sentence)

    ##Always check sentences for names 1
    names1Reg = "[Ii]k [Bb]en ([A-Z]{1}[a-z ]{2,})+[\\.\\,]*"
    res = re.search(names1Reg, sentence)
    if res != None:
        ##adjust result
        x = re.sub('[Ii]k [Bb]en ', '', res.group())
        x = re.sub('[\\,\\.]', '', x)
        ##use NLP to only keep names
        

    ##Always check sentences for names 2
    names2Reg = "[Mm]ijn [Nn]aam is ([A-Z]{1}[a-z\s-]{2,})+[\\.\\,]*"
    res = re.search(names2Reg, sentence)
    if res != None:
        ##adjust result
        x = re.sub('[Mm]ijn [Nn]aam is ', '', res.group())
        x = re.sub('[\\,\\.]', '', x)
        ##use NLP to only keep names
        

    ##Check for single letter followed by dot and series of letters
    if x == "":
        regNameLet = "^[A-Z]{1}\\.[A-Za-z]{2,}|\s[A-Z]{1}\\.[A-Za-z]{2,}"
        res = re.search(regNameLet, sentence)
        if res != None:
            ##replace word in sentence, first at start
            sentence = re.sub('^[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
            ##next in sentence with additional space
            strY = " " + strz
            sentence = re.sub('\s[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)

    ##Check for occurence of two subsequent uppercase words (might be a name)
    if x == "":
        res = re.findall("[A-Z]{1}[a-z]{2,}\s[A-Z]{1}[a-z]{2,}", sentence)
        if res != []:
            for y in res:
                if len(y) > 1:
                    ##replace name with strX
                    sentence = re.sub(y, strz, sentence)

    ##Always recheck remaining sentence with NLP to make sure all personal info is removed
    sentence = pureNLP2(sentence)  ##pureNLP2 tries to include entity checks

    return (sentence)

This is my function for finding names of university and with this function i don't want to replace them

school ['Hogenschool Amsterdam', 'Universiteit van Amsterdam']
strX='xxx'

def school (sentence):
   for schoolname in school:
     res = re.findall(schoolname,sentence)
     if res !=[]:
        for y in res:
            if len(y) >1:
               sentence = replaceNice(sentence, strX, y)
      return(sentence)
print(school('Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam'))

output : Mijn naam xxx en ik volg een opleiding aan de xxx xxx

The output that i want is: Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam

I think i have a start. But when i want to finish the variable sentence im a bit stuck, because here i want to say if the string has matching words from the list School don't replace it, but just print it back.


Solution

  • Replace all the safe words with a lower case version, then anonymize, then restore the lower cased safe words to their original form.

    test_strings = ['Adam goes to Universiteit van Amsterdam', 'George goes to Washington College', 'Anthony Hopkins is a student at Johns Hopkins']
    safe_words = ['Universiteit van Amsterdam', 'Johns Hopkins', 'Washington College']
    
    def anonymize(sentence, safe_words):
        restore = {}
    
        for word in safe_words:
            sentence = sentence.replace(word, word.lower())
    
            restore[word.lower()] = word
        
        for word in sentence.split():
            if word[0].isupper():
                sentence = sentence.replace(word, word[0]+'.')
        
        for word, restored_word in restore.items():
            sentence = sentence.replace(word, restored_word)
        
        return sentence
    
    for sentence in test_strings:
        print(anonymize(sentence, safe_words))
    

    Output:

    A. goes to Universiteit van Amsterdam
    G. goes to Washington College
    A. H. is a student at Johns Hopkins