Search code examples
phpregextwitterplaintexttweets

How to remove everything from tweet but plain text with php?


I'm trying to get rid of urls, mentions, hashtags from a tweet to get only the actual text so instead of:

Hello this is a test @someone #tag1 #tag2 http://bit.ly/123

it'd be just:

Hello this is a test

I believe I'd have to use some sort of regular expression but I'm terrible at it, could someone point me in the right direction?

Thanks in advance.


Solution

  • Here's how to do it in three regular expressions (you could probably merge all three in one, but let's not go there!)

    $str = preg_replace('/(^|\b)@\S*($|\b)/', '', $str); // remove @someone
    $str = preg_replace('/(^|\b)#\S*($|\b)/', '', $str); // remove hashtags
    
    // taken from http://daringfireball.net/2010/07/improved_regex_for_matching_urls
    $urlRegex = '~(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))~';
    $str = preg_replace($urlRegex, '', $str); // remove urls