Search code examples
pythonhtmlweb-scrapingbeautifulsoupreturn

Python BeautifulSoup find_all() method return unnecessary element


I'm having trouble with scrapping elements with the find_all() method.

I am looking for the <li class='list-row'>.....</li> tag but after scrapping it returns <li class='list-row reach-list'> tags with different classes too.

I tried with the select() method too.

Here's the python code:

with open('index.html', 'r') as f:
     contents = f.read()
    soup = BeautifulSoup(html,"html.parser")
    main_block = conn(limit_txt,limit).find('ul', class_='list')
    for li in main_block.find_all('li',class_='list-row'):
        print(li.prettify())

Here's the html file: index.html

<ul class="list">
 <li class="list-row">
  <h2>
   <a href="/praca/emis/O4533184" id="offer4533184">
    <span class="title">
     Senior Developer (HTML, React, VUE.js, C#, SQL)
    </span>
   </a>
  </h2>
 </li>
 <li class="list-row reach-list">
  <ul class="list">
    <span class="employer">
     IT lions consulting a.s.
    </span>
   </li>
  </ul>
 </li>
</ul>

Solution

  • You can specify that you only want <li> tags which contains <h2> element (for example):

    from bs4 import BeautifulSoup
    
    html_doc = '''\
    <ul class="list">
     <li class="list-row">
      <h2>
       <a href="/praca/emis/O4533184" id="offer4533184">
        <span class="title">
         Senior Developer (HTML, React, VUE.js, C#, SQL)
        </span>
       </a>
      </h2>
     </li>
     <li class="list-row reach-list">
      <ul class="list">
        <span class="employer">
         IT lions consulting a.s.
        </span>
       </li>
      </ul>
     </li>
    </ul>'''
    
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    for li in soup.select('.list-row:has(h2)'):
        print(li)
    

    Prints:

    <li class="list-row">
    <h2>
    <a href="/praca/emis/O4533184" id="offer4533184">
    <span class="title">
         Senior Developer (HTML, React, VUE.js, C#, SQL)
        </span>
    </a>
    </h2>
    </li>
    

    Or: To select only <li> with titles: '.list-row:has(.title)'