Example:
html = <a><b>Text</b>Text2</a>
BeautifullSoup code
[x.extract() for x in html.findAll(.//b)]
in exit we have:
html = <a>Text2</a>
Lxml code:
[bad.getparent().remove(bad) for bad in html.xpath(".//b")]
in exit we have:
html = <a></a>
because lxml think "Text2" it's a tail of <b></b>
If we need only text line from join of tags we can use:
for bad in raw.xpath(xpath_search):
bad.text = ''
But, how do that without changing text, but remove tags without tail?
Edit:
I did the following to safe the tail text to the previous sibling or parent.
def remove_keeping_tail(self, element):
"""Safe the tail text and then delete the element"""
self._preserve_tail_before_delete(element)
element.getparent().remove(element)
def _preserve_tail_before_delete(self, node):
if node.tail: # preserve the tail
previous = node.getprevious()
if previous is not None: # if there is a previous sibling it will get the tail
if previous.tail is None:
previous.tail = node.tail
else:
previous.tail = previous.tail + node.tail
else: # The parent get the tail as text
parent = node.getparent()
if parent.text is None:
parent.text = node.tail
else:
parent.text = parent.text + node.tail
HTH