To make the code simple to explain, I have the following code in which I am taking HTML and using HAP to find all image src
and replacing it with a number.
HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlString);
int Counter = 0;
document.DocumentNode.Descendants("img")
.Where(e =>
{
string src = e.GetAttributeValue("src", null) ?? "";
return !string.IsNullOrEmpty(src);
})
.ToList()
.ForEach(x =>
{
string currentSrcValue = x.GetAttributeValue("src", null);
localImgPath = "<Somepath>IMG_" + Counter.ToString() + ".jpg";
Counter++;
});
x.SetAttributeValue("src", localImgPath);
});
INPUT : <img src="https://imagepath"/>
OUTPUT: <img src="<somepath>IMG_1.jpg"/>
Now this works perfectly
but the issue I am facing is, some of the images are inside the hyperlink
such as
<a href="https://imagepath"><img src="https://imagepath"/></a>
While processing images I want to find out if the image is inside the hyperlink and remove the hyperlink such as the following
INPUT : <a href="https://imagepath"><img src="https://imagepath"/></a>
OUTPUT: <img src="<somepath>IMG_1.jpg"/>
A point to be noted that I do not want to remove all hyperlinks in my HTML, only hyperlink which is a parent of an image.
Is it possible using HAP?
You should be able to accomplish this with the below code. You would want to grab all the image elements and check the parent. If the parent is a link you should add it to a list of Nodes that you want removed.
var images = document.DocumentNode.Descendants("img").ToList();
var nodesToRemove = new List<HtmlNode>();
foreach (var image in images)
{
var parent = image.ParentNode;
if (parent.Name.Equals("a"))
{
nodesToRemove.Add(parent);
}
}
Then remove those nodes by getting its parent and calling the RemoveChild
method. This will take the node you want removed, plus a bool stating whether or not to keep the grandchildren (which in this case you would want to since you want to keep the image elements).
foreach (var node in nodesToRemove)
{
node.ParentNode.RemoveChild(node, true);
}