Extracting text with parent tag type from HTML using Python

I'm looking to extract text and element type from some HTML. For example:

<div>
    some text
    <h1>some header</h1>
    some more text
</div>

Should give:

[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]

How can I parse through the HTML to extract this information?

I've tried using BeautifulSoup and am able to extract the information for one level in the HTML, like this:

soup = BeautifulSoup(html, features='html.parser')

for child in soup.findChildren(recursive=False):
    print(child.name)
    for c in child.contents:
        print(c.name)
        print(c.text)

Which gives the following output:

div
None
   text here

h1
some header
None
  more text here

Solution

Using lxml and recursion I can do

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

def display(item):
    print('item:', item)
    print('tag :', item.tag)
    print('text:', item.text.strip())
    tail = item.tail.strip()
    if tail:
        print('tail:', tail, '| parent:', item.getparent().tag)
    
    print('---')
    
    for child in item.getchildren():
        display(child)
        
import lxml.html

soup = lxml.html.fromstring(text)

display(soup)

Which gives

item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---

It treats some more text as tail of h1 but you can use getparent() to assign it to div

After small modification

text = '''<div>
    some text
    <h1>some header</h1>
    some more text
</div>
'''

import lxml.html

results = []

def convert(item):
    results.append({'tag': item.tag, 'text': item.text.strip()})
    
    tail = item.tail.strip()
    
    if tail:
        results.append({'tag': item.getparent().tag, 'text': tail})
    
    for child in item.getchildren():
        convert(child)
        
soup = lxml.html.fromstring(text)

convert(soup)

print(results)

it gives results

[
   {'tag': 'div', 'text': 'some text'}, 
   {'tag': 'h1', 'text': 'some header'}, 
   {'tag': 'div', 'text': 'some more text'}
]