Search code examples
c#htmlhtml-agility-pack

c# HTMLAgilityPack remove <img> nodes


I am really new to using HTMLAgilityPack. I have the following HTML document :

<a href="https://twitter.com/RedGiantNews" target="_blank"><img 
src="http://image.e.redgiant.com/lib/998.png" width="24" border="0" 
alt="Twitter" title="Twitter" class="smImage"></a><a 
href="https://www.facebook.com/RedGiantSoftware" target="_blank"><img 
src="http://image.e.redgiant.com/lib/db5.png" width="24" border="0" 
alt="Facebook" title="Facebook" class="smImage"></a>
http://click.e.redgiant.com/?qs=d2ad061f
<a href="https://www.instagram.com/redgiantnews/" target="_blank"><img 
src="http://image.e.redgiant.com/aa10-f8747e56f06d.png" width="24" 
border="0" alt="Instagram" title="Instagram" class="smImage"></a>

I am trying to remove all images, i mean all nodes(if this is the right word) of <img....> from the html file. I tried the below code from another solution on StackOverflow but in vain as it returns the same HTMl as above :

var sb = new StringBuilder();
doc.LoadHtml(inputHTml);

foreach (var node in doc.DocumentNode.ChildNodes)
{
 if (node.Name != "img" && node.Name!="a")
  {
    sb.Append(node.InnerHtml);
  }
}

Solution

  • static string OutputHtml = @"<a href=""https://twitter.com/RedGiantNews"" target=""_blank""><img 
                                        src=""http://image.e.redgiant.com/lib/998.png"" width=""24"" border=""0"" 
                                        alt=""Twitter"" title=""Twitter"" class=""smImage""></a><a
                                        href = ""https://www.facebook.com/RedGiantSoftware"" target=""_blank""><img
                                        src = ""http://image.e.redgiant.com/lib/db5.png"" width=""24"" border=""0"" 
                                        alt=""Facebook"" title=""Facebook"" class=""smImage""></a>
                                        <a href = ""https://www.instagram.com/redgiantnews/"" target=""_blank""><img
                                        src = ""http://image.e.redgiant.com/aa10-f8747e56f06d.png"" width=""24"" 
                                        border=""0"" alt=""Instagram"" title=""Instagram"" class=""smImage""></a>";
    

    I removed the floating link (http://click.e.redgiant.com/?qs=d2ad061f) from the original html string.

    Approach One:

    public static string RemoveAllImageNodes(string html)
        {
            try
            {
                HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.LoadHtml(html);
    
                var nodes = document.DocumentNode.SelectNodes("//img");
    
                foreach (var node in nodes)
                {
                    node.Remove();
                    //node.Attributes.Remove("src"); //This only removes the src Attribute from <img> tag
                }
    
                html = document.DocumentNode.OuterHtml;
                return html;
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }
    

    Approach Two:

    public static string RemoveAllImageNodes(string html)
        {
            try
            {
                HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.LoadHtml(html);
    
                if (document.DocumentNode.InnerHtml.Contains("<img"))
                {
                    foreach (var eachNode in document.DocumentNode.SelectNodes("//img"))
                    {
                        eachNode.Remove();
                        //eachNode.Attributes.Remove("src"); //This only removes the src Attribute from <img> tag
                    }
                }
    
                html = document.DocumentNode.OuterHtml;
                return html;
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }
    

    OutPut Html:

    <a href="https://twitter.com/RedGiantNews" target="_blank"></a>
    <a href="https://www.facebook.com/RedGiantSoftware" target="_blank"></a>
    <a href="https://www.instagram.com/redgiantnews/" target="_blank"></a>
    

    Output Html - After removing only the "src" attributes from "img" tag(s):

    <a href="https://twitter.com/RedGiantNews" target="_blank"><img width="24" border="0" alt="Twitter" title="Twitter" class="smImage"></a>
    <a href="https://www.facebook.com/RedGiantSoftware" target="_blank"><img width="24" border="0" alt="Facebook" title="Facebook" class="smImage"></a>
    <a href="https://www.instagram.com/redgiantnews/" target="_blank"><img width="24" border="0" alt="Instagram" title="Instagram" class="smImage"></a>