I'm currenlty facing a problem with mathjax equations containing '<' symbols. If I parse these with lxml the string gets cropped.
Is there a way to tell the parser to not remove unknown tags (I guess thats the problem) but keep them as they are?
E.g
s="<div> This is a text with mathjax like $1<2$, let's see if this works till here $2>1$! </div>"
from lxml import html
tree=html.fragment_fromstring(s)
html.tostring(tree)
gives:
'<div> This is a text with mathjax like $11$! </div>'
It would be fine if the '<' gets escaped an nothing cropped.
I am totally aware that this is not valid xml. But, unfortunately, I cannot replace the '<' symbols with the correct html escaped symbol in the source, because actually, I'm trying to parse a markdown file containing html tags and the < symbol is a perfectly fine symbol here.
Thanks!
Jakob
Lxml alone does not work here, but using BeautifulSoup works fine!
s1="This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!"
import lxml.html.soupparser as sp
from lxml import html
soup1 = sp.fromstring(s1)
print sp.unescape(html.tostring(soup1, encoding='unicode'))
gives
<html>This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!</html>