Search code examples
pythonregexgdprconsentform

Python RegEx to remove social security number from string generated with speech-to-text


I'm trying to remove social security numbers (SSN) for GDPR compliant reasons from messy data generated with speech-to-text. Here is a sample string (translated to English which explains why 'and' occurs when the SSN are listed):

sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"

My goal is to remove the part "thirteen ... forty " while keeping other numbers that may appear in the string resulting in:

sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"

The length of the social security number can vary as a consequence of how data is generated (3-10 separated numbers).

My approach:

  1. Replace written numbers with digits using a dict
  2. Use regex to find where 3 or more numbers occur with only whitespace or "and" separating them and remove these together with any number following these 3 numbers.

Here is my code:

import re

number_dict = {
    'zero': '0',
    'one': '1',
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'seven': '7',
    'eight': '8',
    'nine': '9',
    'ten': '10',
    'eleven': '11',
    'twelve': '12',
    'thirteen': '13',
    'fourteen': '14',
    'fifteen': '15',
    'sixteen': '16',
    'seventeen': '17',
    'eighteen': '18',
    'nineteen': '19',
    'twenty': '20',
    'thirty': '30',
    'forty': '40',
    'fifty': '50',
    'sixty': '60',
    'seventy': '70',
    'eighty': '80',
    'ninety': '90'
}


sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1_temp = [number_dict.get(item,item)  for item in sample1.split()]
sample1_numb = ' '.join(sample1_temp)
re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb) 

print(re_results)

Output:

[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]

This is where I'm stuck.

In this example I could do something like sample1_wh_ssn = re.sub(re_results[0][0],'',sample1_numb) to get the desired result, but this will not generalize.

Any help would be greatly appreciated.


Solution

  • Here is an implementation of your current logic, namely:

    • Convert word numbers from 1 through 99 into numbers
    • Remove all instances of 3 or more numbers separated with whitespaces
    • Convert numbers two-digit numbers back to words.

    Credits:

    See Python code:

    import re
    
    number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
    number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
    number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
    main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
    numbers_1_99 = number_words
    numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # stackoverflow.com/a/8982279/3832970
        for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
        for ones in numbers_1_99[0:10])
    
    def text2int(textnum, numwords={}): # stackoverflow.com/a/493788/3832970
        units = [
            "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
            "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
            "sixteen", "seventeen", "eighteen", "nineteen",
        ]
        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):
            numwords[word] = (1, idx)
        for idx, word in enumerate(tens):
            numwords[word] = (1, idx * 10)
        current = result = 0
        for word in textnum.split():
            if word not in numwords:
              raise Exception("Illegal word: " + word)
    
            scale, increment = numwords[word]
            current = current + increment
    
        return result + current
    sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
    sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
    re_results = main_rx.sub('', sample1)
    print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )
    

    Output: hello my name is sofie my social security number is and I live on mountain street number twelve