Search code examples
haskelldoctypehxt

HXT ignoring HTML DTD, replacing it with XML DTD


I'm having a bit of trouble figuring out why HXT is replacing my DTD's. Firstly, here is my input file to be parsed:

<!DOCTYPE html>
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <h1>foo</h1>
  </body>
</html>

and this is the output that I get:

<?xml version="1.0" encoding="US-ASCII"?>
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <h1>foo</h1>
  </body>
</html>

Finally, here is a simplified version of the arrows I'm using:

start (App src dest) = runX $
                         readDocument [ withValidate no
                                      , withSubstDTDEntities no
                                      , withParseHTML yes
                                      --, withTagSoup
                                      ]
                                      src
                         >>>
                         this
                         >>>
                         writeDocument [ withIndent yes
                                       , withSubstDTDEntities no
                                       , withOutputHTML
                                       --, withOutputEncoding "UTF-8"
                                       ]
                                       dest

I apologize for the comments - I've been toying with different combinations of configs. I just can't seem to get HXT to not mess with DTDs, even with withSubstDTDEntities no, withValidate no, etc. I am getting a warning saying that HXT is ignoring my doctype declaration, but that's the only bit of insight I have. Can anyone please lend me a hand? Thank you in advance!


Solution

  • You have two problems

    HXT only accepts one of the following three html doctypes

    <!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     "DTD/xhtml1-strict.dtd">
    
    <!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">
    
    <!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
     "DTD/xhtml1-frameset.dtd">
    

    Using one of these will get rid of the warning about ignoring the dtd.

    Second, add the following option to writeDocument

    withAddDefaultDTD yes