Search code examples
c#html-agility-packepub

Why HtmlAgilityPack doesn't see the body in some EPUB documents


I'm trying to parse EPUB documents (opened with VersOne.Epub) using HtmlAgilityPack, and it started off working, but as I test more, it starts missing body in some books.

As originally I tested English language books, and the problem first appeared with a Russian-language book, I thought the alphabet or encoding were the problem, but that doesn't seem to be the case, because I managed to get body in one book by going to either "/html/body" or just "//body". But this isn't working in the latest one. For example, this short section:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
<link rel="stylesheet" href="style.css" type="text/css"/>
<link rel="stylesheet" href="style.css" type="text/css"/>
</head>
<body class="z">
<span id="id154"><div class="title3">
<p class="p">295</p>
</div><p class="p1">От слова bando - публичное оповещение - произошло слово bando-lero, означавшее разбойника, голова которого была оценена.</p></span>
</body>
</html>

The Epub library gets the string fine. The html tag is closed, check. The body tag is closed, check. The encoding is given. HtmlAgilityPack returns null. What could be the problem?

Just in case, here's what I'm doing:

HtmlDocument htmlDocument = new();
htmlDocument.LoadHtml(textContentFile.Content);
var bodyNode = htmlDocument.DocumentNode.SelectSingleNode("//body");

textContentFile comes from the Epub reading order.


Solution

  • Try following :

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Xml;
    using System.Xml.Linq;
    
    namespace ConsoleApplication2
    {
        class Program
        {
            static void Main(string[] args)
            {
                string xml = @"<?xml version=""1.0"" encoding=""UTF - 8""?>
                         <html xmlns = ""http://www.w3.org/1999/xhtml"" >
                             <head>
                               <title/>
                               <link rel = ""stylesheet"" href = ""style.css"" type = ""text/css"" />
                               <link rel = ""stylesheet"" href = ""style.css"" type = ""text/css"" />
                             </head >
                            <body class=""z"">
                               <span id = ""id154"" >
                                  <div class=""title3"">
                                     <p class=""p"">295</p>
                                  </div>
                                  <p class=""p1"">От слова bando - публичное оповещение - произошло слово bando-lero, означавшее разбойника, голова которого была оценена.</p></span>
                            </body>
                        </html>
    ";
                XDocument doc = XDocument.Parse(xml);
                XNamespace ns = doc.Root.GetDefaultNamespace();
    
                XElement body = doc.Descendants(ns + "body").FirstOrDefault();
            }
     
    
        }
    
     
    }