I am trying to build a regex for parsing an HTML file and getting all image files. I need to do this in order to embed images before sending it as an e-mail.
Is there a "list of places" where images can be referenced? For example, I know I need to look inside <img src="here" />
, or in a CSS style url('here')
, or background='here'
, but does that cover all cases?
And does the regex already exist somewhere? I find writing regexes painful, and I don't want to miss a case, or forget to handle some broken HTML markup.
For <img>
tags, I found something like this:
(?<=img\s+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])
but I don't know how to include other places.
Don't use regex to parse html, instead use an Html parser like HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var a = doc.DocumentNode.Descendants("img")
.Select(x => x.Attributes["src"].Value)
.ToArray();