Search code examples
pythonregexpython-3.xcharacter-class

Cleaning Text with python and re


I need to clean some text like the code below says:

import re
def clean_text(text):
    text = text.lower()
    #foction de replacement
    text = re.sub(r"i'm","i am",text)
    text = re.sub(r"she's","she is",text)
    text = re.sub(r"can't","cannot",text)
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)
    return text

clean_questions= []
for question in questions: 
    clean_questions.append(clean_text(question))

and this code must give me the questions list clean but I got the clean questions empty. I reopened the spyder and the list got full but without being cleaned and then reopened it and I got it empty .. the console error says:

In [10] :clean_questions= [] 
   ...: for question in questions: 
   ...: clean_questions.append(clean_text(question))
Traceback (most recent call last):

  File "<ipython-input-6-d1c7ac95a43f>", line 3, in <module>
    clean_questions.append(clean_text(question))

  File "<ipython-input-5-8f5da8f003ac>", line 16, in clean_text
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 286, in _compile
   p = sre_compile.compile(pattern, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 580, in _parse
    raise source.error(msg, len(this) + 1 + len(that))

error: bad character range }-=

I am using Python 3.6, specifically the Anaconda build Anaconda3-2018.12-Windows-x86_64.


Solution

  • Your character class (as shown in the traceback) is invalid; } comes after = in ordinal value (} is 125, = is 61), and the - in between them means it's trying to match any character from }'s ordinal to ='s and in between. Since character ranges must go from low ordinal to high ordinal, 125->61 is nonsensical, thus the error.

    In a way you got lucky; if the characters around the - had been reversed, e.g. =-}, you'd have silently removed all characters from ordinal 61 to 125 inclusive, which would have included, along with a mess of punctuation, all standard ASCII letters, both lower and uppercase.

    You could fix this by just removing the second - in your character class (you already included it at the beginning of the class where it doesn't need to be escaped), changing from

    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]", "", text)
    

    to

    text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", "", text)
    

    but I'm going to suggest dropping regular expressions here; the risk of mistakes with lots of literal punctuation is high, and there are other methods that don't involve regex at all that should work just fine and not make you worry if you escaped all the important stuff (the alternative is over-escaping, which makes the regex unreadable, and still error-prone).

    Instead, replace that line with a simple str.translate call. First off, outside the function, make a translation table of the things to remove:

    # The redundant - is harmless here since the result is a dict which dedupes anyway
    killpunctuation = str.maketrans('', '', r"-()\"#/@;:<>{}-=~|.?,")
    

    then replace the line:

    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)
    

    with:

    text = text.translate(killpunctuation)
    

    It should run at least as fast as the regex (likely faster), and it's far less error-prone, since no character has special meaning (translation tables are just mappings from Unicode ordinals to None, meaning delete, another ordinal, meaning single character replacement, or a string, meaning char -> multichar replacement; they don't have a concept of special escapes). If the goal is killing all ASCII punctuation, you're probably better off using the string module constant to define the translation table (which also makes the code more self-documenting, so people aren't wondering if you are removing all or just some punctuation, and whether it was intentional):

    import string
    killpunctuation = str.maketrans('', '', string.punctuation)
    

    As it happens, your existing string is not removing all punctuation (it misses, among other things, ^, !, $, etc.), so this change might not be correct, but if it is correct, definitely make it. If it's supposed to be a subset of punctuation, you definitely want to add a comment as to how that punctuation was chosen, so maintainers don't wonder if you made a mistake.