Break out an html-element from within a table-element

I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.

So, how do I delete the whole table and keep the h4-tag where it is?

<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
  <tr> 
    <td><a href="index.html" target="_top" onclick="MM_nbGroup('down','group1','contents','',1)" onmouseover="MM_nbGroup('over','contents','../figs/contents1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','authorindex','',1)" onmouseover="MM_nbGroup('over','authorindex','../figs/iauthori1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','subjindex','',1)" onmouseover="MM_nbGroup('over','subjindex','../figs/isubji1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../search.html" target="_top" onclick="MM_nbGroup('down','group1','search','',1)" onmouseover="MM_nbGroup('over','search','../figs/isearch1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','home','',1)" onmouseover="MM_nbGroup('over','home','../figs/ihome1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></a></td>
  </tr>
</table>

Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.

I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.

Solution

One way to solve the problem is to find all <h4> nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>:

Given some sample HTML that resides in a HTML file:

var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
  <tr> 
    <td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
    <td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
  </tr>
</table>

<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>

<p><h4>H4 nested in P</h4></p>

</body>
</html>";

Parse it with this method:

public string ParseHtmlToString(string inputFilePath)
{
    var document = new HtmlDocument();
    document.Load(inputFilePath);
    var wantedNodes = document.DocumentNode.SelectNodes("//h4");
    // stop at these tags while walking backwards up the chain
    var stopTags = new string[] { "table", "div", "p" };
    HtmlNode parentNode;

    foreach (var node in wantedNodes)
    {
        HtmlNode testNode = node;
        while ((parentNode = testNode.ParentNode) != null)
        {
            if (stopTags.Contains(parentNode.Name))
            {
                parentNode.ParentNode.ReplaceChild(node, parentNode);
            }
            testNode = parentNode;
        }
    }

    return document.DocumentNode.WriteTo();
}

Then you can assign the parsed HTML to a variable like this:

var parsedHtml = ParseHtmlToString(INPUT_FILE);

which returns the following value:

<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>

<h4>H4 nested in DIV</h4>

<h4>H4 nested in P</h4>

</body>
</html>