<div1 class="tag1">
<div2 class="tag2">
<div3 class="tag3">no</div3>
yes
</div2>
</div1>
I want to parse div1 and I get its text if it has one
and I want to keep {name_class: tag1 (or None), text: None}
,
and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}
My code to resolve this problem:
from pyquery import PyQuery as pq
a = '<div><div>no</div>yes</div>'
tryy = pq(a)[0]
tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)
type(tryy) = lxml.etree._Element'
But the problem is : lxml.etree._Element.text
not keep "yes" contained in div2
I tried this but it does not work with bs4 Only extracting text from this element, not its children
All solutions whatever the library is welcome
Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:
from lxml import etree
root = etree.parse("sample.xml")
for element in root.getiterator():
print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")
Outputs:
, tag1,
, tag2,
no, tag3, yes