Search code examples
pythonhtmlregexstriphtmlcleaner

Python regex to strip html a tags without href attribute


I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.

<a rel="nofollow">Link to be removed</a>

should become

Link to be removed

The same for:

<a>Other link to be removed</a>

Shoudl become:

Other link to be removed

Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.


Solution

  • Use drop_tag method.

    import lxml.html
    
    root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
    for a in root.xpath('a[not(@href)]'):
        a.drop_tag()
    
    assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'
    

    http://lxml.de/lxmlhtml.html

    .drop_tag(): Drops the tag, but keeps its children and text.