I'm having a string that has been cleaned with lxml's Cleaner, so all links are now in the form Content. Now I'd like to strip out all links that have no href attribute, e.g.
<a rel="nofollow">Link to be removed</a>
should become
Link to be removed
The same for:
<a>Other link to be removed</a>
Shoudl become:
Other link to be removed
Simply all links with a missing href attribute. It doesn't have to be regex, but since lxml returns a clean markup structure, it should be possible. What I need, is a source string stripped of such non-functional a tags.
Use drop_tag
method.
import lxml.html
root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
a.drop_tag()
assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'
.drop_tag(): Drops the tag, but keeps its children and text.