python html python-3.x beautifulsoup html-parsing

Beatifulsoup find_all when a tag is not inside another tag

html = """
<html>
   <h2>Top Single Name</h2>
   <table>
      <tr>
         <p>hello</p>
      </tr>
   </table>
   <div>
      <div>
         <h2>Price Return</h2>
      </div>
   </div>
</html>
"""

When I Use below code

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])

I am getting output as

[<h2>Top Single Name</h2>,
 <tr><p>hello</p></tr>,
 <p>hello</p>,
 <div>
 <div>
 <h2>Price Return</h2>
 </div>
 </div>,
 <div>
 <h2>Price Return</h2>
 </div>,
 <h2>Price Return</h2>]

But what I need is below only three elements

[<h2>Top Single Name</h2>,
<tr><p>hello</p></tr>,
<div>
 <div>
 <h2>Price Return</h2>
 </div>
 </div>
]

Basically I don't want to extract a specific tag if it is inside another tag, is there any way i can have some mapping like below and use in the code don't extract when the key is inside value

{'re.compile("^h[1-6]$")': 'div', 'div':'div', 'p': 'tr'}

Solution

Basically I don't want to extract a specific tag if it is inside another tag

I think the simplest way might be to use find_all just as you are now, and then filter out the nested tags by checking if they have ancestors/parents in the list

sel = soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])
sel = [s for s in sel if not [p for p in sel if p in s.parents]]

-- same results as getting tags if their tagName is in a list as long as if none of their parents have one of the listed names:

selTags = ['p', 'li', 'dl', 'tr', 'div'] + [f'h{i}' for i in range(1,7)]
sel = soup.find_all(lambda t: t.name in selTags and not t.find_parent(selTags))

but if you want to filter by a map

is there any way i can have some mapping like below and use in the code don't extract when the key is inside value

you could use

parentMap = {'div':'div', 'p': 'tr'}
for i in range(1,7): parentMap[f'h{i}'] = 'div'
# parentMap = {'div': 'div', 'p': 'tr', 'h1': 'div', 'h2': 'div', 'h3': 'div', 'h4': 'div', 'h5': 'div', 'h6': 'div'}

sel = soup.find_all(
    lambda t: t.name in 
    ['p', 'li', 'dl', 'tr', 'div']+[f'h{i}' for i in range(1,7)]
    and not (
        t.name in parentMap and 
        t.find_parent(parentMap[t.name]) is not None
    )
)

In this case, you should get the same results either way, but if your html contained

<p><tr>I am a row in a paragraph</tr></p>

then the first methods will return only the outer <p> tag whereas the last method will return both the <p> tag and the inner <tr> tag [unless you add 'tr': 'p' to parentMap].