Search code examples
pythonbeautifulsouplxmlpyquery

Parse an HTML element using the pyquery library or beautifulsoup. or a different alternative


<div1 class="tag1">
  <div2 class="tag2">
    <div3 class="tag3">no</div3>
    yes
  </div2>
</div1>

I want to parse div1 and I get its text if it has one and I want to keep {name_class: tag1 (or None), text: None}, and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}

My code to resolve this problem:

from pyquery import PyQuery as pq

a = '<div><div>no</div>yes</div>'
tryy = pq(a)[0]

tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)

type(tryy) = lxml.etree._Element' But the problem is : lxml.etree._Element.text not keep "yes" contained in div2

I tried this but it does not work with bs4 Only extracting text from this element, not its children

All solutions whatever the library is welcome


Solution

  • Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:

    from lxml import etree
    
    root = etree.parse("sample.xml")
    
    for element in root.getiterator():
        print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")
    

    Outputs:

    , tag1, 
    , tag2, 
    no, tag3, yes