Search code examples
pythonparsingpython-requestspython-requests-html

Exclude span from parsing with requests-html


I need help with parsing a web page with Python and requests-html lib. Here is the <div> that I want to analyze:

<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>

It renders as:

Text

I need to get Te<b>x</b>t as a result of parsing, without <div> and <span> but with <b> tags.

Using element as a requests-html object, here is what I am getting.

element.html:
<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>

element.text:
ATe\nx\nt

element.full_text:
AText

Could you please tell me how can I get rid of <span> but still get <b> tags in the parsing result?


Solution

  • Don't overcomplicate it.

    How about some simple string processing and get the string between two boundaries:

    • Use element.html
    • take everything after the close </span>
    • Take everything before the close </div>

    Like this

    myHtml = '<div class="answer"><span class="marker">А</span>Te<b>x</b>t</div>'
    
    myAnswer = myHtml.split("</span>")[1]
    myAnswer = myAnswer.split("</div>")[0]
    
    print(myAnswer)
    

    output:

    Te<b>x</b>t
    

    Seems to work for your sample provided. If you have more complex requirements let us know and I'm sure someone can adapt thus further.