Search code examples
pythonhtmlpython-3.xbeautifulsouphtml-parsing

Beatifulsoup find_all when a tag is not inside another tag


html = """
<html>
   <h2>Top Single Name</h2>
   <table>
      <tr>
         <p>hello</p>
      </tr>
   </table>
   <div>
      <div>
         <h2>Price Return</h2>
      </div>
   </div>
</html>
"""

When I Use below code

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])

I am getting output as

[<h2>Top Single Name</h2>,
 <tr><p>hello</p></tr>,
 <p>hello</p>,
 <div>
 <div>
 <h2>Price Return</h2>
 </div>
 </div>,
 <div>
 <h2>Price Return</h2>
 </div>,
 <h2>Price Return</h2>]

But what I need is below only three elements

[<h2>Top Single Name</h2>,
<tr><p>hello</p></tr>,
<div>
 <div>
 <h2>Price Return</h2>
 </div>
 </div>
]

Basically I don't want to extract a specific tag if it is inside another tag, is there any way i can have some mapping like below and use in the code don't extract when the key is inside value

{'re.compile("^h[1-6]$")': 'div', 'div':'div', 'p': 'tr'}

Solution

  • Basically I don't want to extract a specific tag if it is inside another tag

    I think the simplest way might be to use find_all just as you are now, and then filter out the nested tags by checking if they have ancestors/parents in the list

    sel = soup.find_all(['p', 'li', 'dl', 'tr', 'div', re.compile("^h[1-6]$")])
    sel = [s for s in sel if not [p for p in sel if p in s.parents]]
    

    -- same results as getting tags if their tagName is in a list as long as if none of their parents have one of the listed names:

    selTags = ['p', 'li', 'dl', 'tr', 'div'] + [f'h{i}' for i in range(1,7)]
    sel = soup.find_all(lambda t: t.name in selTags and not t.find_parent(selTags))
    

    but if you want to filter by a map

    is there any way i can have some mapping like below and use in the code don't extract when the key is inside value

    you could use

    parentMap = {'div':'div', 'p': 'tr'}
    for i in range(1,7): parentMap[f'h{i}'] = 'div'
    # parentMap = {'div': 'div', 'p': 'tr', 'h1': 'div', 'h2': 'div', 'h3': 'div', 'h4': 'div', 'h5': 'div', 'h6': 'div'}
    
    sel = soup.find_all(
        lambda t: t.name in 
        ['p', 'li', 'dl', 'tr', 'div']+[f'h{i}' for i in range(1,7)]
        and not (
            t.name in parentMap and 
            t.find_parent(parentMap[t.name]) is not None
        )
    )
    


    In this case, you should get the same results either way, but if your html contained

    <p><tr>I am a row in a paragraph</tr></p>
    

    then the first methods will return only the outer <p> tag whereas the last method will return both the <p> tag and the inner <tr> tag [unless you add 'tr': 'p' to parentMap].