I am trying to extract all the text from the tweets before the URL starting with "https:...".
Example Tweet:
"This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"
In this example I would like to remove the "https://... (Video via @QuickTake)" and get the text from the beginning. But it should also work for when the tweet comes without any URL link in the tweet text.
I have tried this expression and gets two matches for when it comes with URL:
/(.*)(?=\shttps.*)|(.*)
How can I make it to retrieve only the text from the tweets.
Thanks in advance!
You may remove the https
and all tha follows till the end of string, use
tweet = re.sub(r'\s*https.*', '', tweet)
Details:
\s*
- 0+ whitespaceshttps
- a string.*
- the rest of the string (line).