I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:
$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );
I'm pretty new to regex but from what I've learned ^
marks the beginning of a pattern and $
marks the end. What am I doing wrong?
You need to match the characters in the middle of the URL:
/\bhttp[\w%+\/-]+?pdf\b/
\b
matches a word boundary
^
and $
mark the beginning and end of the entire string. You don't want them here.
[...]
matches any character in the brackets
\w
matches any word character
+
matches one or more of the previous match
?
makes the +
lazy rather than greedy