Search code examples
phphref

Getting content of href value


I need to catch the content of href using regex. For example, when I apply the rule to href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.

Now, I was playing around for some time, and I came up with this:

href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')

When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).

What's my mistake here?


Solution

  • Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:

    $dom = new DomDocument;
    $dom->loadHTML($pageContent);
    
    $elements = $dom->getElementsByTagName('a');
    
    for ($n = 0; $n < $elements->length; $n++) {
        $item = $elements->item($n);
        $href = $item->getAttribute('href');
        // here's your href attribute
    }