Search code examples
htmlhaskellhxt

Is it possible to parse by hxt <textarea> with <a> in it?


I use hxt to parse some html. It hase unescaped html inside <textarea>. hxt gives invalid results (it stumbles upon a tag with content in this case it's <a>). Minimal testcase (for GHCi) is

let doc = parseHtml "<textarea>before<a>link</a>after</textarea>"
runX . xshow $ doc //> hasName "textarea"

which gives [<textarea>before</textarea><textarea/>] as a result.

It looks like tags with no contents (e.g. <tag/>) do not break parsing.

Is there any way to parse such html with hxt?


Solution

  • The problem is that HandsomeSoup (which I'm assuming is where your parseHTML is from) is picky about things like the fact that a textarea can't contain an a in valid HTML, and will try to "fix" any such errors it sees.

    Can you switch to hxt-tagsoup? It will still accept messy HTML (unclosed elements, etc.), but isn't so fussy about adherence to the HTML schema—specifically it will let you have an a in a textarea:

    import Text.XML.HXT.Core
    import Text.XML.HXT.TagSoup
    
    let content = "<textarea>before<a>link</a>after</textarea>"
    let doc = readString [ withTagSoup ] content
    runX . xshow $ doc //> hasName "textarea"
    

    This prints the following:

    ["<textarea>before<a>link</a>after</textarea>"]
    

    Which I think is what you want.