Search code examples
pythonregextweepytweets

How to choose first match from Alternation regex?


I am trying to extract all the text from the tweets before the URL starting with "https:...".

Example Tweet:

"This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"

In this example I would like to remove the "https://... (Video via @QuickTake)" and get the text from the beginning. But it should also work for when the tweet comes without any URL link in the tweet text.

I have tried this expression and gets two matches for when it comes with URL:

/(.*)(?=\shttps.*)|(.*)

How can I make it to retrieve only the text from the tweets.

Thanks in advance!


Solution

  • You may remove the https and all tha follows till the end of string, use

    tweet = re.sub(r'\s*https.*', '', tweet)
    

    Details:

    • \s* - 0+ whitespaces
    • https - a string
    • .* - the rest of the string (line).