Search code examples
pythonlxmllxml.html

How to keep all html elements with selector but drop all others?


I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.

Let's say I just want to keep all p and a tags inside the div with class="A".

Input:

<div class="A">
  <p>Text1</p>
  <img src="A.jpg">
  <div class="sub1">
    <p>Subtext1</p>
  </div>
  <p>Text2</p>
  <a href="url">link text</a>
</div>
<div class="B">
  ContentDiv2
</div>

Expected output:

<div class="A">
  <p>Text1</p>
  <p>Text2</p>
  <a href="url">link text</a>
</div>

If I'd know all the selectors of all other elements I could just use lxml's drop_tree(). But the problem is that I don't know ['img', 'div.sub1', 'div.B'] upfront.

Example with drop_tree():

import lxml.cssselect
import lxml.html

tree = lxml.html.fromstring(html_str)

elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
    selector = lxml.cssselect.CSSSelector(j)
    for e in selector(tree):
        e.drop_tree()

output = lxml.html.tostring(tree)

Solution

  • I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:

    target = tree.xpath('//div[@class="A"]')[0]
    to_keep = target.xpath('//p | //a')
    for t in target.xpath('.//*'):
        if t not in to_keep:
            target.remove(t) #I believe this method is better here than drop_tree()
    print(lxml.html.tostring(target).decode())
    

    The output I get is your expected output.