Search code examples
phpregexurlhrefquoting

Extract single and double quotes urls from link using regex


I need to extract Twitter ids for a PHP script using regex. It works great as long as the URLs is coded with double quotes...

<a href='http://www.twitter.com/singlequotes'>Twitter Single Quotes</a>
<a href="http://www.twitter.com/doublequotes">Twitter Double Quotes</a>

// regular expression
/<a [^>]*\bhref\s*=\s*"\K[^"]*twitter.com[^"]*/

I have tried using "|', ["']and many other things, that are not working. Would be very thankful, if you could help me with this. Thanks!

https://regex101.com/r/7Zu3uF/1


Solution

  • This is as fast as you can go. No capture group is needed.

    href=['"]\K[^'"]+

    Pattern Demo

    Look for a single or double quote after href= then match everything that isn't a single or double quote. That is as simple as it can be made.

    p.s. If you are concerned with spaces near the = then use:

    href *= *['"]\K[^'"]+

    PHP Implementation (PHP Demo):

    $in='<a href=\'http://www.twitter.com/singlequotes\'>Twitter Single Quotes</a>
    <a href="http://www.facebook.com/doublequotes">Twitter Double Quotes</a>
    <a href=\'http://twitter.com/singlequotes\'>Twitter Single Quotes</a>
    <a href="https://www.facebook.com/doublequotes">Twitter Double Quotes</a>';
    
    $companies=['twitter','facebook'];
    
    $out=preg_match_all('/href *= *[\'"]\Khttps?:\/\/(?:www\.)?(?:'.implode('|',$companies).')\.com[^\'"]+/',$in,$out)?$out[0]:[];
    
    var_export($out);