Search code examples
phpregexpreg-match

Regular expression starting with http and ending with pdf?


I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:

$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );

I'm pretty new to regex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?


Solution

  • You need to match the characters in the middle of the URL:

    /\bhttp[\w%+\/-]+?pdf\b/
    
    • \b matches a word boundary

    • ^ and $ mark the beginning and end of the entire string. You don't want them here.

    • [...] matches any character in the brackets

    • \w matches any word character

    • + matches one or more of the previous match

    • ? makes the + lazy rather than greedy