Search code examples
pythonregexpunctuation

Remove punctuation from the string; in specific 3 dots (ellipses) at the end of the line using python


I have a python script that cleans an uploaded dataframe. One of the function is to remove the punctuation from the string, the function works on all punctuation in a string except for the [...] 3 dots or ellipses at the end of the line

In example 1, the function works as it should, while in the example 2 it keep the 3 dots at the end of the string.

======

import re
from string import punctuation
    ppt = '''...!@#$%^&*()....{}[]|._-`/?:;"'\,~12345678876543''' 

    def processTweet(tweet):
        '''
        parameters:
        ====================
        - tweets: list of text 
       
        functions:
        ====================
        - Remove HTML special entities (e.g. &)
        - Convert @username to AT_USER
        - Remove tickers
        - convert to lowercase
        - Remove hyperlinks
        - Remove hashtags
        - Remove Punctuation and split 's, 't, 've with a space for filter
        
        '''
        # Remove HTML special entities (e.g. &)
        tweet = re.sub(r'\&\w*;', '', tweet)
        #Convert @username to AT_USER
        tweet = re.sub('@[^\s]+','',tweet)
        # Remove tickers
        tweet = re.sub(r'\$\w*', '', tweet)
        # To lowercase
        tweet = tweet.lower()
        # Remove hyperlinks
        tweet = re.sub(r'https?:\/\/.*\/\w*', '', tweet)
        # Remove hashtags
        tweet = re.sub(r'#\w*', '', tweet)
        # Remove Punctuation and split 's, 't, 've with a space for filter
        tweet = re.sub(r'[' + ppt.replace('@', '') + ']+', ' ', tweet)
        return tweet

Solution

  • If it is an ellipsis as pointed out by @JvdV,

    tweet = tweet.replace('\u2026','')
    

    This will remove any ellipses.