I'm looking to extract text and element type from some HTML. For example:
<div>
some text
<h1>some header</h1>
some more text
</div>
Should give:
[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]
How can I parse through the HTML to extract this information?
I've tried using BeautifulSoup
and am able to extract the information for one level in the HTML, like this:
soup = BeautifulSoup(html, features='html.parser')
for child in soup.findChildren(recursive=False):
print(child.name)
for c in child.contents:
print(c.name)
print(c.text)
Which gives the following output:
div
None
text here
h1
some header
None
more text here
Using lxml
and recursion I can do
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
def display(item):
print('item:', item)
print('tag :', item.tag)
print('text:', item.text.strip())
tail = item.tail.strip()
if tail:
print('tail:', tail, '| parent:', item.getparent().tag)
print('---')
for child in item.getchildren():
display(child)
import lxml.html
soup = lxml.html.fromstring(text)
display(soup)
Which gives
item: <Element div at 0x7f2b0ed4b6d0>
tag : div
text: some text
---
item: <Element h1 at 0x7f2b0ed3cef0>
tag : h1
text: some header
tail: some more text | parent: div
---
It treats some more text
as tail of h1
but you can use getparent()
to assign it to div
After small modification
text = '''<div>
some text
<h1>some header</h1>
some more text
</div>
'''
import lxml.html
results = []
def convert(item):
results.append({'tag': item.tag, 'text': item.text.strip()})
tail = item.tail.strip()
if tail:
results.append({'tag': item.getparent().tag, 'text': tail})
for child in item.getchildren():
convert(child)
soup = lxml.html.fromstring(text)
convert(soup)
print(results)
it gives results
[
{'tag': 'div', 'text': 'some text'},
{'tag': 'h1', 'text': 'some header'},
{'tag': 'div', 'text': 'some more text'}
]