Search code examples
phpregexhref

PHP: Change HTML href Parser function to only match if it finds a static string in url


i am trying to modify some code that parses text for html hyperlinks and puts them into a database.

The change i'm trying to make is to only match if html hyperlink contains a certain text such as:

<a href="http://example.com/images/test1.jpg">my image</a>

would not be matched but

<a href="http://example.com/thisismyunique/string/test2.jpg">my image2</a>

would be matched based on that it has the "/thisismyunique/string" in the url.

Any ideas?

class blcHTMLLink extends blcParser {
    var $supported_formats = array('html');

  /**
   * Parse a string for HTML links - <a href="URL">anchor text</a>
   *
   * @param string $content The text to parse.
   * @param string $base_url The base URL to use for normalizing relative URLs. If ommitted, the blog's root URL will be used. 
   * @param string $default_link_text 
   * @return array An array of new blcLinkInstance objects. The objects will include info about the links found, but not about the corresponding container entity. 
   */
    function parse($content, $base_url = '', $default_link_text = ''){
        //remove all <code></code> blocks first
        $content = preg_replace('/<code[^>]*>.+?<\/code>/si', ' ', $content);

        //Find links
        $params = array(
            'base_url' => $base_url,
            'default_link_text' => $default_link_text,
        );
        $instances = $this->map($content, array($this, 'parser_callback'), $params);

        //The parser callback returns NULL when it finds an invalid link. Filter out those nulls
        //from the list of instances.
        $instances = array_filter($instances);

        return $instances;
    }

  /**
   * blcHTMLLink::parser_callback()
   *
   * @access private
   *
   * @param array $link
   * @param array $params
   * @return blcLinkInstance|null
   */
    function parser_callback($link, $params){
        global $blclog;
        $base_url = $params['base_url'];

        $url = $raw_url = $link['href'];
        $url = trim($url);
        //$blclog->debug(__CLASS__ .':' . __FUNCTION__ . ' Found a link, raw URL = "' . $raw_url . '"');

        //Sometimes links may contain shortcodes. Execute them.
        $url = do_shortcode($url);

        //Skip empty URLs
        if ( empty($url) ){
            $blclog->warn(__CLASS__ .':' . __FUNCTION__ . ' Skipping the link (empty URL)');
            return null;
        };

        //Attempt to parse the URL
        $parts = @parse_url($url);
        if(!$parts) {
            $blclog->warn(__CLASS__ .':' . __FUNCTION__ . ' Skipping the link (parse_url failed)', $url);
            return null; //Skip invalid URLs
        };

        if ( !isset($parts['scheme']) ){
            //No scheme - likely a relative URL. Turn it into an absolute one.
            //TODO: Also log the original URL and base URL.
            $url = $this->relative2absolute($url, $base_url); //$base_url comes from $params
            $blclog->info(__CLASS__ .':' . __FUNCTION__ . ' Convert relative URL to absolute. Absolute URL = "' . $url . '"');
        }

        //Skip invalid links (again)
        if ( !$url || (strlen($url)<6) ) {
            $blclog->info(__CLASS__ .':' . __FUNCTION__ . ' Skipping the link (invalid/short URL)', $url);
            return null;
        }

        //Remove left-to-right marks. See: https://en.wikipedia.org/wiki/Left-to-right_mark
        $ltrm = json_decode('"\u200E"');
        $url = str_replace($ltrm, '', $url);

        $text = $link['#link_text'];

        //The URL is okay, create and populate a new link instance.
        $instance = new blcLinkInstance();

        $instance->set_parser($this);
        $instance->raw_url = $raw_url;
        $instance->link_text = $text;

        $link_obj = new blcLink($url); //Creates or loads the link
        $instance->set_link($link_obj);

        return $instance;
    }

Solution

  • If you already have a href index in your $link parameter, which should contains the URL, you could easily do this:

    $blockedWord = '/thisismyunique/string';
    $blockedWordPosition = strpos($link['href'], $blockedWord);
    $hasBlockedWord = $blockedWordPosition !== false;
    

    Watch out, because strpos could return 0, if the needle has found in the beginning of the haystack string.

    Find out more here:

    http://php.net/manual/en/function.strpos.php