Search code examples
c#htmlxpathhtml-agility-pack

Break out an html-element from within a table-element


I'm having problems finding a proper way of breaking out the H4-tag from the following code. Not only do I need to make it stay in the code, but I also need to delete the table it currently sits in.

So, how do I delete the whole table and keep the h4-tag where it is?

<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
  <tr> 
    <td><a href="index.html" target="_top" onclick="MM_nbGroup('down','group1','contents','',1)" onmouseover="MM_nbGroup('over','contents','../figs/contents1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','authorindex','',1)" onmouseover="MM_nbGroup('over','authorindex','../figs/iauthori1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','subjindex','',1)" onmouseover="MM_nbGroup('over','subjindex','../figs/isubji1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../search.html" target="_top" onclick="MM_nbGroup('down','group1','search','',1)" onmouseover="MM_nbGroup('over','search','../figs/isearch1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','home','',1)" onmouseover="MM_nbGroup('over','home','../figs/ihome1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></a></td>
  </tr>
</table>

Further on I have about 2500 html-documents following similar structure, but are in different versions of HTML, thus uses div's, tables or other elements from version to version. So I need a way to alter this method properly.

I have a document load ready, it loads all files in a list, so I will be feeding a method this list of filenames to open and parse. But I can't figure out how to use XPath for this one.


Solution

  • One way to solve the problem is to find all <h4> nodes, walk up it's parent chain until you find a stop tag/node, and replace the stop tag/node with your <h4>:

    Given some sample HTML that resides in a HTML file:

    var html =
    @"<!doctype html system 'html.dtd'>
    <html><head></head>
    <body>
    <table align='center' border='0' cellpadding='0' cellspacing='0'>
    <tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
      <tr> 
        <td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
        <td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
        <td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
        <td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
        <td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
      </tr>
    </table>
    
    <div>
    <h4>H4 nested in DIV</h4>
    <p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
    </div>
    
    <p><h4>H4 nested in P</h4></p>
    
    </body>
    </html>";
    

    Parse it with this method:

    public string ParseHtmlToString(string inputFilePath)
    {
        var document = new HtmlDocument();
        document.Load(inputFilePath);
        var wantedNodes = document.DocumentNode.SelectNodes("//h4");
        // stop at these tags while walking backwards up the chain
        var stopTags = new string[] { "table", "div", "p" };
        HtmlNode parentNode;
    
        foreach (var node in wantedNodes)
        {
            HtmlNode testNode = node;
            while ((parentNode = testNode.ParentNode) != null)
            {
                if (stopTags.Contains(parentNode.Name))
                {
                    parentNode.ParentNode.ReplaceChild(node, parentNode);
                }
                testNode = parentNode;
            }
        }
    
        return document.DocumentNode.WriteTo();
    }
    

    Then you can assign the parsed HTML to a variable like this:

    var parsedHtml = ParseHtmlToString(INPUT_FILE);
    

    which returns the following value:

    <!doctype html system 'html.dtd'>
    <html><head></head>
    <body>
    <h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>
    
    <h4>H4 nested in DIV</h4>
    
    <h4>H4 nested in P</h4>
    
    </body>
    </html>