Search code examples
.netregeximagehtml-parsingembedding

How do I find all image links in a HTML string


I am trying to build a regex for parsing an HTML file and getting all image files. I need to do this in order to embed images before sending it as an e-mail.

Is there a "list of places" where images can be referenced? For example, I know I need to look inside <img src="here" />, or in a CSS style url('here'), or background='here', but does that cover all cases?

And does the regex already exist somewhere? I find writing regexes painful, and I don't want to miss a case, or forget to handle some broken HTML markup.

For <img> tags, I found something like this:

(?<=img\s+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

but I don't know how to include other places.


Solution

  • Don't use regex to parse html, instead use an Html parser like HtmlAgilityPack

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    
    var a = doc.DocumentNode.Descendants("img")
                .Select(x => x.Attributes["src"].Value)
                .ToArray();