Search code examples
python-3.xregexregex-groupregexp-replacensregularexpression

Python Regex - remove all "." and special characters EXCEPT the decimal point


I have some sentences with multiple "."s.

How can I remove all special characters and '.' in the data except the decimal point?

The Input Example is

What? The Census Says It’s Counted 99.9 Percent of Households. Don’t Be Fooled.

and I want to remove all "." s and special characters EXCEPT the decimal point'.'

The output should be like

What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled

I tried this ,

regex = re.compile('[^ (\w+\.\w+)0-9a-zA-Z]+')
regex.sub('', test)

But the output was

What The Census Says Its Counted 99.9 Percent of Households. Dont Be Fooled.

Solution

  • Use a capturing group to capture only the decimal numbers and at the same time match special chars (ie. not of space and word characters).

    Upon replacement, just refer to the capturing group in-order to make use of only the captured chars. ie. the whole match would be removed and replaced by the decimal number if exists.

    s = 'What? The Census Says It’s Counted 99.9 Percent of Households. Don’t Be Fooled.'
    import re
    rgx = re.compile(r'(\d\.\d)|[^\s\w]')
    rgx.sub(lambda x: x.group(1), s)
    # 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'
    

    OR

    Match all the dots except the one exists between the numbers and all chars except special chars and then finally replace those match chars with an empty string.

    re.sub(r'(?!<\d)\.(?!\d)|[^\s\w.]', '', s)
    # 'What The Census Says Its Counted 99.9 Percent of Households Dont Be Fooled'