I need to catch the content of href using regex. For example, when I apply the rule to href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.
Now, I was playing around for some time, and I came up with this:
href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')
When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).
What's my mistake here?
Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:
$dom = new DomDocument;
$dom->loadHTML($pageContent);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
// here's your href attribute
}