Search code examples
phpregexfacebookcommentsvbulletin

differentiate between two almost identical links in regex


I have created a plugin that turns links into the Facebook embedded version of the content at the link. My problem is if I disable the part of the plugin for comments, the links to comments become embedded posts (if the post portion of the plugin is still active).

Lets take a look, so we have 3 links:

Facebook post

<a href="https://www.facebook.com/zuck/posts/10102577175875681" target="_blank">ONE</a>

Comment

<a href="https://www.facebook.com/zuck/posts/10102577175875681?comment_id=1193531464007751" target="_blank">Two</a>

and a reply to a comment

<a href="https://www.facebook.com/zuck/posts/10102577175875681?comment_id=1193531464007751&reply_comment_id=10102577641662241" target="_blank">Three</a>

with all three links beginning with

https://www.facebook.com/zuck/posts/10102577175875681

In the following code, the if conditions are my setting toggles, and this post message is equal to what a user posts, so in this example this post message is equal to the three links above.

This is the plugin I have created for converting these links.

if ($this->registry->options['drcae_facebook_comment_onoff']) {
  // swaps facebook comment links to embed code
  $drc_embed_facebook_cmt = '<div class="fb-comment-embed" data-include-parent="true" data-width="560" data-href="https://www.facebook.com/$3/posts/$4comment_id=$5"></div>';
  $this->post['message'] = preg_replace('~<a (.*)href="(.*)facebook.com/(.*)/posts/(.*)?comment_id=(.*)"(.*)<\/a>~', $drc_embed_facebook_cmt, $this->post['message']);
}

if ($this->registry->options['drcae_facebook_post_onoff']) {
  // swaps facebook post links to embed code
  $drc_embed_facebook_post = '<div class="fb-post" data-href="https://www.facebook.com/$3/posts/$4"></div>';
  $this->post['message'] = preg_replace('~<a (.*)href="(.*)facebook.com/(.*)/posts/(.*)"(.*)<\/a>~', $drc_embed_facebook_post, $this->post['message']);
}

I did have this flipped the other way (post being first) but this caused comments to embed the posts, I got around this by checking for comments first which is probably not the best way to do this.

So you may have noticed my regex, it's not the greatest but it's what I was able to make work on my own being new to regex altogether.

~<a (.*)href="(.*)facebook.com/(.*)/posts/(.*)"(.*)<\/a>~

I choose to do my regex this way so it didnt matter if a link was formatted like the following it would still embed:

<a target="blank" href="https://www.facebook.com/USERNAME/posts/1234567890" alt="facebook post">LINK</a>

But now I'm second guessing my work, and after searching and not coming up with anything, I figured I would ask for some assistance.

How can I differentiate between these links so posts, don't interfere with comments / with comment replies?

Update 1, embeded posts

Now my plugin looks like this

$drc_embed_facebook_post = '<div class="fb-post" data-href="https://www.facebook.com/$2/posts/$3"></div>';
$this->post['message'] = preg_replace('~<a (.*?)facebook\.com/([^/]+)/[^/]+/([0-9]+)(?:[?][^0-9]+([0-9]+)(?:&(.+))?)?</a>~', $drc_embed_facebook_post, $this->post['message']);

Regex specifically

~<a (.*?)facebook\.com/([^/]+)/[^/]+/([0-9]+)(?:[?][^0-9]+([0-9]+)(?:&(.+))?)?</a>~

I have left the beginning a lazy anything? I believe... to not restrict www. https:// ect... (anything that comes before facebook.com)

This partially works, grabbing links directly to posts here are a few example.

https://www.facebook.com/RyanNewMe/posts/616837631826216?pnref=story
https://www.facebook.com/zuck/posts/10102833246942211?pnref=story
https://www.facebook.com/zuck/posts/10102830259184701?pnref=story

these links do not embed the post. However if I remove ?pnref=story from them all, only the following link does not work.

https://www.facebook.com/RyanNewMe/posts/616837631826216

Solution

  • I created a nice, fast regex to extract the href earlier today, so I'm going to use that as a baseline:

    <a(?:\s*(?!href)[^\s>]*)*\s*href=["']([^"']+)
    

    If you use this regex, you will get whatever the value of the href attribute is as the match. For example:

    https://www.facebook.com/zuck/posts/10102577175875681
    
    https://www.facebook.com/zuck/posts/10102577175875681?comment_id=1193531464007751
    
    https://www.facebook.com/zuck/posts/10102577175875681?comment_id=1193531464007751&reply_comment_id=10102577641662241
    

    Then you can parse this section.

    I made this regex which seems to work:

    facebook\.com/([^/]+)/[^/]+/([0-9]+)(?:[?][^0-9]+([0-9]+)(?:&(.+))?)?
    

    You should find your matches in $1, $2, $3, and $4 for "zuck", the original id, comment id, and the entire rest of the link respectively. (Yes, I got lazy at the end there, do you need the end of the link parsed into pieces?)

    It looks really complex, but it's actually pretty understandable.

    • facebook\.com/ matches facebook.com/

    • [^\]+ matches one or more non-slashes

    • ([0-9]+) captures one or more numbers

    • This blob: (?:[?][^0-9]+([0-9]+)(?:&(.+))?)? specifies the optional extensions (that's the ending ?s).

      • The (?:) means non capturing group (mostly to avoid incrementing the names of $2 and $3).
      • [?][^0-9]+ means that there's a ? followed by some non digits.
      • ([0-9]+) captures digits
      • &(.+) matches an & and then captures the rest of the string.

    Edit: Regarding your update, the regex can be fixed like this (unless I'm missing the problem):

    ~<a (.*?)facebook\.com/([^/]+)/[^/]+/([0-9]+)(?:[?][^0-9<]+([0-9]*)(?:&([^<]+))?)?</a>~