I need help with parsing a web page with Python and requests-html lib. Here is the <div>
that I want to analyze:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
It renders as:
Text
I need to get Te<b>x</b>t
as a result of parsing, without <div>
and <span>
but with <b>
tags.
Using element
as a requests-html object, here is what I am getting.
element.html:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>
element.text:
ATe\nx\nt
element.full_text:
AText
Could you please tell me how can I get rid of <span>
but still get <b>
tags in the parsing result?
Don't overcomplicate it.
How about some simple string processing and get the string between two boundaries:
element.html
</span>
</div>
Like this
myHtml = '<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>'
myAnswer = myHtml.split("</span>")[1]
myAnswer = myAnswer.split("</div>")[0]
print(myAnswer)
output:
Te<b>x</b>t
Seems to work for your sample provided. If you have more complex requirements let us know and I'm sure someone can adapt thus further.