Search code examples
pythonhtml-parsing

Extracting text with parent tag type from HTML using Python


I'm looking to extract text and element type from some HTML. For example:

<div>
    some text
    <h1>some header</h1>
    some more text
</div>

Should give:

[{'tag':'div', 'text':'some text'}, {'tag':'h1', 'text':'some header'}, {'tag':'div', 'text':'some more text'}]

How can I parse through the HTML to extract this information?

I've tried using BeautifulSoup and am able to extract the information for one level in the HTML, like this:

soup = BeautifulSoup(html, features='html.parser')

for child in soup.findChildren(recursive=False):
    print(child.name)
    for c in child.contents:
        print(c.name)
        print(c.text)

Which gives the following output:

div
None
   text here

h1
some header
None
  more text here

Solution

  • Using lxml and recursion I can do

    text = '''<div>
        some text
        <h1>some header</h1>
        some more text
    </div>
    '''
    
    def display(item):
        print('item:', item)
        print('tag :', item.tag)
        print('text:', item.text.strip())
        tail = item.tail.strip()
        if tail:
            print('tail:', tail, '| parent:', item.getparent().tag)
        
        print('---')
        
        for child in item.getchildren():
            display(child)
            
    import lxml.html
    
    soup = lxml.html.fromstring(text)
    
    display(soup)
    

    Which gives

    item: <Element div at 0x7f2b0ed4b6d0>
    tag : div
    text: some text
    ---
    item: <Element h1 at 0x7f2b0ed3cef0>
    tag : h1
    text: some header
    tail: some more text | parent: div
    ---
    

    It treats some more text as tail of h1 but you can use getparent() to assign it to div


    After small modification

    text = '''<div>
        some text
        <h1>some header</h1>
        some more text
    </div>
    '''
    
    import lxml.html
    
    results = []
    
    def convert(item):
        results.append({'tag': item.tag, 'text': item.text.strip()})
        
        tail = item.tail.strip()
        
        if tail:
            results.append({'tag': item.getparent().tag, 'text': tail})
        
        for child in item.getchildren():
            convert(child)
            
    soup = lxml.html.fromstring(text)
    
    convert(soup)
    
    print(results)
    

    it gives results

    [
       {'tag': 'div', 'text': 'some text'}, 
       {'tag': 'h1', 'text': 'some header'}, 
       {'tag': 'div', 'text': 'some more text'}
    ]