Parse an HTML element using the pyquery library or beautifulsoup. or a different alternative

<div1 class="tag1">
  <div2 class="tag2">
    <div3 class="tag3">no</div3>
    yes
  </div2>
</div1>

I want to parse div1 and I get its text if it has one and I want to keep {name_class: tag1 (or None), text: None}, and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}

My code to resolve this problem:

from pyquery import PyQuery as pq

a = '<div><div>no</div>yes</div>'
tryy = pq(a)[0]

tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)

type(tryy) = lxml.etree._Element' But the problem is : lxml.etree._Element.text not keep "yes" contained in div2

I tried this but it does not work with bs4 Only extracting text from this element, not its children

All solutions whatever the library is welcome

Solution

Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:

from lxml import etree

root = etree.parse("sample.xml")

for element in root.getiterator():
    print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")

Outputs:

, tag1, 
, tag2, 
no, tag3, yes