I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.
So, how do I delete the whole table and keep the h4-tag where it is?
<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><a href="index.html" target="_top" onclick="MM_nbGroup('down','group1','contents','',1)" onmouseover="MM_nbGroup('over','contents','../figs/contents1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></a></td>
<td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','authorindex','',1)" onmouseover="MM_nbGroup('over','authorindex','../figs/iauthori1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></a></td>
<td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','subjindex','',1)" onmouseover="MM_nbGroup('over','subjindex','../figs/isubji1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></a></td>
<td><a href="../search.html" target="_top" onclick="MM_nbGroup('down','group1','search','',1)" onmouseover="MM_nbGroup('over','search','../figs/isearch1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></a></td>
<td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','home','',1)" onmouseover="MM_nbGroup('over','home','../figs/ihome1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></a></td>
</tr>
</table>
Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.
I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.
One way to solve the problem is to find all <h4>
nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>
:
Given some sample HTML that resides in a HTML file:
var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
<tr>
<td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
<td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
</tr>
</table>
<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>
<p><h4>H4 nested in P</h4></p>
</body>
</html>";
Parse it with this method:
public string ParseHtmlToString(string inputFilePath)
{
var document = new HtmlDocument();
document.Load(inputFilePath);
var wantedNodes = document.DocumentNode.SelectNodes("//h4");
// stop at these tags while walking backwards up the chain
var stopTags = new string[] { "table", "div", "p" };
HtmlNode parentNode;
foreach (var node in wantedNodes)
{
HtmlNode testNode = node;
while ((parentNode = testNode.ParentNode) != null)
{
if (stopTags.Contains(parentNode.Name))
{
parentNode.ParentNode.ReplaceChild(node, parentNode);
}
testNode = parentNode;
}
}
return document.DocumentNode.WriteTo();
}
Then you can assign the parsed HTML to a variable like this:
var parsedHtml = ParseHtmlToString(INPUT_FILE);
which returns the following value:
<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>
<h4>H4 nested in DIV</h4>
<h4>H4 nested in P</h4>
</body>
</html>