I'm having a bit of trouble figuring out why HXT is replacing my DTD's. Firstly, here is my input file to be parsed:
<!DOCTYPE html>
<html>
<head>
<title>foo</title>
</head>
<body>
<h1>foo</h1>
</body>
</html>
and this is the output that I get:
<?xml version="1.0" encoding="US-ASCII"?>
<html>
<head>
<title>foo</title>
</head>
<body>
<h1>foo</h1>
</body>
</html>
Finally, here is a simplified version of the arrows I'm using:
start (App src dest) = runX $
readDocument [ withValidate no
, withSubstDTDEntities no
, withParseHTML yes
--, withTagSoup
]
src
>>>
this
>>>
writeDocument [ withIndent yes
, withSubstDTDEntities no
, withOutputHTML
--, withOutputEncoding "UTF-8"
]
dest
I apologize for the comments - I've been toying with different combinations of configs. I just can't seem to get HXT to not mess with DTDs, even with withSubstDTDEntities no
, withValidate no
, etc. I am getting a warning saying that HXT is ignoring my doctype declaration, but that's the only bit of insight I have. Can anyone please lend me a hand? Thank you in advance!
You have two problems
HXT only accepts one of the following three html doctypes
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"DTD/xhtml1-frameset.dtd">
Using one of these will get rid of the warning about ignoring the dtd.
Second, add the following option to writeDocument
withAddDefaultDTD yes