Search code examples
pythonbeautifulsouphtml-parsinglxml

How delete tag from node in lxml without tail?


Example:

html = <a><b>Text</b>Text2</a>

BeautifullSoup code

[x.extract() for x in html.findAll(.//b)]

in exit we have:

html = <a>Text2</a>

Lxml code:

[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

in exit we have:

html = <a></a>

because lxml think "Text2" it's a tail of <b></b>

If we need only text line from join of tags we can use:

for bad in raw.xpath(xpath_search):
    bad.text = ''

But, how do that without changing text, but remove tags without tail?


Solution

  • Edit:

    Please look at @Joshmakers answer https://stackoverflow.com/a/47946748/8055036, which is clearly the better one.

    I did the following to safe the tail text to the previous sibling or parent.

    def remove_keeping_tail(self, element):
        """Safe the tail text and then delete the element"""
        self._preserve_tail_before_delete(element)
        element.getparent().remove(element)
    
    def _preserve_tail_before_delete(self, node):
        if node.tail: # preserve the tail
            previous = node.getprevious()
            if previous is not None: # if there is a previous sibling it will get the tail
                if previous.tail is None:
                    previous.tail = node.tail
                else:
                    previous.tail = previous.tail + node.tail
            else: # The parent get the tail as text
                parent = node.getparent()
                if parent.text is None:
                    parent.text = node.tail
                else:
                    parent.text = parent.text + node.tail
    

    HTH