Search code examples
phphtmlhtml-parsingsrctext-extraction

Get src value containing a specific keyword from all <img> tags


I'm trying to match src="URL" tags like the following:

src="http://3.bp.blogspot.com/-ulEY6FtwbtU/Twye18FlT4I/AAAAAAAAAEE/CHuAAgfQU2Q/s320/DSC_0045.JPG"

Basically, anything that has somre sort of bp.blogspot URL inside of the src attribute. I have the following, but it's only partially working:

preg_match('/src=\"(.*)blogspot(.*)\"/', $content, $matches);

Solution

  • This one accepts all blogspot urls and allows escaped quotes:

    src="((?:[^"]|(?:(?<!\\)(?:\\\\)*\\"))+\bblogspot\.com/(?:[^"]|(?:(?<!\\)(?:\\\\)*\\"))+)"
    

    The URL gets captured to match group 1.

    You will need to escape \ and / with an additional \ (for each occurence!) to use in preg_match(…).

    Explanation:

    src=" # needle 1
    ( # start of capture group
        (?: # start of anonymous group
            [^"] # non-quote chars
            | # or:
            (?:(?<!\\)(?:\\\\)*\\") # escaped chars
        )+ # end of anonymous group
        \b # start of word (word boundary)
        blogspot\.com/ # needle 2
        (?: # start of anonymous group
            [^"] # non-quote chars
            | # or:
            (?:(?<!\\)(?:\\\\)*\\") # escaped chars
        )+ # end of anonymous group
        ) # end of capture group
    " # needle 3