python parsing python-requests python-requests-html

Exclude span from parsing with requests-html

I need help with parsing a web page with Python and requests-html lib. Here is the <div> that I want to analyze:

<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>

It renders as:

Text

I need to get Te<b>x</b>t as a result of parsing, without <div> and <span> but with <b> tags.

Using element as a requests-html object, here is what I am getting.

element.html:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>

element.text:
ATe\nx\nt

element.full_text:
AText

Could you please tell me how can I get rid of <span> but still get <b> tags in the parsing result?

Solution

Don't overcomplicate it.

How about some simple string processing and get the string between two boundaries:

Use element.html
take everything after the close </span>
Take everything before the close </div>

Like this

myHtml = '<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>'

myAnswer = myHtml.split("</span>")[1]
myAnswer = myAnswer.split("</div>")[0]

print(myAnswer)

output:

Te<b>x</b>t

Seems to work for your sample provided. If you have more complex requirements let us know and I'm sure someone can adapt thus further.