So I'm using this regex which converts all posts from twitter into an embedded tweet:
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?twitter\.com/([^&]+)/status/([^&]+)\S*~i'
But when I try and do the same for Instagram or Facebook it doesn't work:
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?instagram\.com/p/([^&]+)\S*~i'
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?facebook\.com/([^&]+)/posts/([^&]+)\S*~i'
The regex is almost completely the same and twitter links are almost identical to facebook links e.g. https://twitter.com/USER/status/idnumber
https://www.facebook.com/USER/posts/idnumber
. Instagram does nearly the same but like this https://www.instagram.com/p/id
The reason why I'm using ~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|
at the start of the regex is because I have BBCode on my site and you can see my previous question about regex here
EDIT:
Here is the complete regex and replacement:
$search = array (
'~\[b](.*?)\[/b]~is',
'~\[i](.*?)\[/i]~is',
'~\[u](.*?)\[/u]~is',
'~\[ul](.*?)\[/ul]~is',
'~\[li](.*?)\[/li]~is',
'~\[user=(.*?)](.*?)\[/user]~i',
'~\[url=https?.*?(?:[/?&](?:e|vi?|ci)(?:[/=]|%3D)|youtu\.be/|embed/|/user/[^/]+#p/(?:[^/]+/)+)([\w-]{10,12})].*?\[/url]~i',
'~\[url]https?.*?(?:[/?&](?:e|vi?|ci)(?:[/=]|%3D)|youtu\.be/|embed/|/user/[^/]+#p/(?:[^/]+/)+)([\w-]{10,12}).*?\[/url]~i',
'~\[url=((?:ht|f)tps?://[a-z\d.-]+\.[a-z]{2,3}/\S*?)](.*?)\[/url]~i',
'~\[url]((?:ht|f)tps?://[a-z\d.-]+\.[a-z]{2,3}/\S*?)\[/url]~i',
'~\[img=(.*?)].*?\[/img]~i',
'~\[quote](.*?)\[/quote]~is',
'~\[code](.*?)\[/code]~is',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|(?:\bhttps?.*?(?:[/?&](?:e|vi?|ci)(?:[/=]|%3D)|youtu\.be/|embed/|/user/[^/]+#p/(?:[^/]+/)+)([\w-]{10,12}))\S*~i',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?clips\.twitch\.tv/([^&]+)\S*~i',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?imgur\.com/gallery/([^&]+)\S*~i',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?twitter\.com/([^&]+)/status/([^&]+)\S*~i',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://(?:www:)?facebook\.com/([^&]+)/posts/([^&]+)\S*~i',
'~(?:<a.*?</a>|<img.*?</img>|<iframe.*?</iframe>)(*SKIP)(*FAIL)|\bhttps?://.+?(?=\s|$)~im'
);
$replace = array (
'<strong>$1</strong>',
'<em>$1</em>',
'<u>$1</u>',
'<ul>$1</ul>',
'<li>$1</li>',
'<a href="../login/profile?u=$1" target="_blank">$2</a>',
'<br><iframe width="600" height="315" src="//www.youtube.com/embed/$1" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><br>',
'<br><iframe width="600" height="315" src="//www.youtube.com/embed/$1" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><br>',
'<a href="$1" target="_blank" rel="nofollow">$2</a>',
'<a href="$1" target="_blank" rel="nofollow">$1</a>',
'<img src="$1"></img>',
'<quote>$1</quote>',
'<code>$1</code>',
'<br><iframe width="600" height="315" src="//www.youtube.com/embed/$1" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><br>',
'<br><iframe width="600" height="315" src="//clips.twitch.tv/embed?clip=$1&autoplay=false" frameborder="0" allowfullscreen></iframe><br>',
'<blockquote class="imgur-embed-pub" lang="en" data-id="$1"><a href="//www.imgur.com/$1"></a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>',
'<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="//twitter.com/$1/status/$2?ref_src=twsrc%5Etfw"></a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>',
'<iframe src="//www.facebook.com/plugins/post.php?href=//www.facebook.com/$1/posts/$2&width=500" width="500" height="705" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true" allow="encrypted-media"></iframe>',
'<a href="$0" target="_blank" rel="nofollow">$0</a>'
);
Without knowing seeing some sample urls that you are trying to extract from, it is hard to say with absolute certainty, but perhaps I can offer some general advice.
([^&]+)
<-- This is going to be capturing one or more non-ampersand characters. This "greedy quanitifier (+
) will be match and match and match whitespace and visible characters on multiple lines until it finds the next &
or the end of the string! ...clearly not what you want.
If you want to make sure that there are no &
, ?
, #
characters, you can use ([^&?#]+)
. However, this too may consume too much because if the url doesn't contain any of those characters, the regex engine is going to match too much.
If you are uncertain about the characters that will exist, but you know that they will be "visible" characters, you can use \S+
.
Finally, you can add the white-space characters to your "negated character class" like this: ([^&?#\s]+)
By using this last one, you can follow it immediately with \S*
which will match/consume zero or more trailing visible characters -- this will ensure that the whole url is replaced and that you are only getting the "white meat" that you are seeking.