Search code examples
pythonregexascii

Regex For Special Character (S with line on top)


I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii

Here's there code:

import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))

I would expect it to output:

ra_ndom_word_

But instead I get:

ra_ndom_wordS__

Solution

  • The reason Python works this way is that you are actually looking at two distinct characters; there's an S and then it's followed by a combining macron U+0304

    In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.

    import unicodedata
    
    def cleanup(line):
        cleaned = []
        strip = False
        for char in line:
            if unicodedata.combining(char):
                strip = True
                continue
            if strip:
                cleaned.pop()
                strip = False
            if unicodedata.category(char) not in ("Ll", "Lu"):
                char = "_"
            cleaned.append(char)
        return ''.join(cleaned)
    

    By the by, \W does not need square brackets around it; it's already a regex character class.

    Python's re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories.

    "Ll" is lowercase alphabetics and "Lu" are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L") maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm