Search code examples
c#html-agility-pack

HTML Agility Pack - Removing hyperlinks parent to an image


To make the code simple to explain, I have the following code in which I am taking HTML and using HAP to find all image src and replacing it with a number.

HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlString);                
int Counter = 0;
document.DocumentNode.Descendants("img")
         .Where(e =>
          {
           string src = e.GetAttributeValue("src", null) ?? "";
           return !string.IsNullOrEmpty(src);
          })
          .ToList()
          .ForEach(x =>
            {
            string currentSrcValue = x.GetAttributeValue("src", null);                                
            localImgPath = "<Somepath>IMG_" + Counter.ToString() + ".jpg";                      
            Counter++;
            });
           x.SetAttributeValue("src", localImgPath);
          });

INPUT : <img src="https://imagepath"/>

OUTPUT: <img src="<somepath>IMG_1.jpg"/>

Now this works perfectly

but the issue I am facing is, some of the images are inside the hyperlink such as

<a href="https://imagepath"><img src="https://imagepath"/></a>

While processing images I want to find out if the image is inside the hyperlink and remove the hyperlink such as the following

INPUT : <a href="https://imagepath"><img src="https://imagepath"/></a>

OUTPUT: <img src="<somepath>IMG_1.jpg"/>

A point to be noted that I do not want to remove all hyperlinks in my HTML, only hyperlink which is a parent of an image.

Is it possible using HAP?


Solution

  • You should be able to accomplish this with the below code. You would want to grab all the image elements and check the parent. If the parent is a link you should add it to a list of Nodes that you want removed.

    var images = document.DocumentNode.Descendants("img").ToList();
    
    var nodesToRemove = new List<HtmlNode>();
    
    foreach (var image in images)
    {
        var parent = image.ParentNode;
        if (parent.Name.Equals("a"))
        {
            nodesToRemove.Add(parent);
        }
    }
    

    Then remove those nodes by getting its parent and calling the RemoveChild method. This will take the node you want removed, plus a bool stating whether or not to keep the grandchildren (which in this case you would want to since you want to keep the image elements).

    foreach (var node in nodesToRemove)
    {
        node.ParentNode.RemoveChild(node, true);
    }