Search code examples
pythonpython-re

Replacing replaces in a faster way


I'm filtering lots of tweets and while I was doing tests on how to filter each character I ended up with this:

x = open(string, encoding='utf-8')
text = x.read()
text = re.sub(r'http\S+' + '\n', '', text, )
text = re.sub(r'http\S+', '', text,)  # removes links
text = re.sub(r'@\S+' + '\n', '', text)
text = re.sub(r'@\S+', '', text)  # removes usernames
text = text.replace('0', '').replace('1', '').replace('2', '').replace('3', '') \
    .replace('4', '').replace('5', '').replace('6', '').replace('7', '').replace('8', '').replace('9', '') \
    .replace(',', '').replace('"', '').replace('“', '').replace('?', '').replace('¿', '').replace(':', '') \
    .replace(';', '').replace('-', '').replace('!', '').replace('¡', '').replace('.', '').replace('ℹ', '') \
    .replace('\'', '').replace('[', '').replace(']', '').replace('   ', '').replace('  ', '').replace('”', '') \
    .replace('º', '').replace('+', '').replace('#', '').replace('\n', '').replace('·', '\n')
text = remove_emoji(text).lower()
x.close()

Which was useful because I could test many things but now I think that I'm not going to modify this anymore so it's ready to be optimized, how could I make it faster? All the replaces replace with nothing except .replace('·', '\n')


Solution

  • You can achieve most of this with string maketrans and translate methods - they let you define a mapping from any single char to any given string

    s = "asd123.?fgh"
    
    translations = {"1":"", "2":"", "3":"", ".":"\n", "?": ""}
    print(s.translate(s.maketrans(translations)))
    

    It will do all the changes in a single pass through the string, making it much faster.