Search code examples
javaimagehtml-parsingsrctext-extraction

Get src value of <img> tags with inconsistent quoting


I need a clever regex to match ... in these:

<img src="..."
<img src='...'
<img src=...

I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.

Any ideas how to match these 3 cases with one regex.

So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....


Solution

  • Wow, second one I'm answering today.

    Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.

    Java, for example has JTidy and PHP has PHP Tidy.

    UPDATE

    Against my better judgement, I'm giving you this:

    /<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/

    Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).