Search code examples
pythonpython-3.xweb-scrapingbeautifulsouphtml-parsing

Beautiful soup parse multiple tags with different Attributes


sentences.find_all(['p','h2'],attrs={['class':None,'class':Not None]}).

This is an invalid syntax but is there any alternative to doing this. I want p tags with one attribute and h2 tag with another attribute and I need them sequentially not like finding them as two parse tree i.e I don't want to do

  1. sentences.find_all('p',attrs={'class':None])
  2. sentences.find_all('h2',attrs={'class':Not None])

Solution

  • You can use CSS selector with , (CSS reference):

    from bs4 import BeautifulSoup
    
    html_doc = """
    <p class="cls1">Select this</p>
    <p class="cls2">Don't select this</p>
    <h2 class="cls3">Select this</h2>
    <h2 class="cls4">Don't select this</h2>
    """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    for tag in soup.select("p.cls1, h2.cls3"):
        print(tag)
    

    Prints:

    <p class="cls1">Select this</p>
    <h2 class="cls3">Select this</h2>
    

    EDIT: To select multiple tags and one tag has to have empty attributes:

    from bs4 import BeautifulSoup
    
    html_doc = """
    <p>Select this</p>
    <p class="cls2">Don't select this</p>
    <h2 class="cls3">Select this</h2>
    <h2 class="cls4">Don't select this</h2>
    """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    for tag in soup.select("p, h2.cls3"):
        if tag.name == "p" and len(tag.attrs) != 0:
            continue
        print(tag)
    

    Prints:

    <p>Select this</p>
    <h2 class="cls3">Select this</h2>