Search code examples
pythonlxmlmathjax

Parse '<' Symbol with lxml


I'm currenlty facing a problem with mathjax equations containing '<' symbols. If I parse these with lxml the string gets cropped.

Is there a way to tell the parser to not remove unknown tags (I guess thats the problem) but keep them as they are?

E.g

s="<div> This is a text with mathjax like $1<2$, let's see if this works till here $2>1$! </div>"
from lxml import html
tree=html.fragment_fromstring(s)
html.tostring(tree)

gives:

'<div> This is a text with mathjax like $11$! </div>'

It would be fine if the '<' gets escaped an nothing cropped.

I am totally aware that this is not valid xml. But, unfortunately, I cannot replace the '<' symbols with the correct html escaped symbol in the source, because actually, I'm trying to parse a markdown file containing html tags and the < symbol is a perfectly fine symbol here.

Thanks!

Jakob


Solution

  • Lxml alone does not work here, but using BeautifulSoup works fine!

    s1="This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!"
    import lxml.html.soupparser as sp
    from lxml import html  
    soup1 = sp.fromstring(s1)
    print sp.unescape(html.tostring(soup1, encoding='unicode'))
    

    gives

    <html>This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!</html>