I use hxt to parse some html. It hase unescaped html inside <textarea>
. hxt gives invalid results (it stumbles upon a tag with content in this case it's <a>
). Minimal testcase (for GHCi) is
let doc = parseHtml "<textarea>before<a>link</a>after</textarea>"
runX . xshow $ doc //> hasName "textarea"
which gives [<textarea>before</textarea><textarea/>]
as a result.
It looks like tags with no contents (e.g. <tag/>
) do not break parsing.
Is there any way to parse such html with hxt?
The problem is that HandsomeSoup (which I'm assuming is where your parseHTML
is from) is picky about things like the fact that a textarea
can't contain an a
in valid HTML, and will try to "fix" any such errors it sees.
Can you switch to hxt-tagsoup? It will still accept messy HTML (unclosed elements, etc.), but isn't so fussy about adherence to the HTML schema—specifically it will let you have an a
in a textarea
:
import Text.XML.HXT.Core
import Text.XML.HXT.TagSoup
let content = "<textarea>before<a>link</a>after</textarea>"
let doc = readString [ withTagSoup ] content
runX . xshow $ doc //> hasName "textarea"
This prints the following:
["<textarea>before<a>link</a>after</textarea>"]
Which I think is what you want.