Search code examples
pythonregextwittersentiment-analysis

This python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well


I need to remove any URL in the tweets review. How to only remove the URL if it is found in the beginning of tweet?

I've try some code and this python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well.

re.sub(r'https?:\/\/.*[\r\n]*\S+', '', verbatim, flags = re.MULTILINE)

If URL found in the beginning of tweets, all of the sentence will be remove as well.


Solution

  • The pattern https?:\/\/.*[\r\n]*\S+ matches http(optional s)://

    Then the .* part matches until the end of the string, then this part [\r\n]* matches 0+ newlines and \S+ will match 1+ non whitespace chars.

    So the url is matched, followed by the rest of the string, a newline and 1+ non whitespace chars at the next line as well.

    You could shorten the pattern to:

    \bhttps?://\S+
    

    Regex demo