Search code examples
pythonweb-scrapingbeautifulsoupweb-crawlerhtml-parsing

Grabbing the main text content of an HTML tag without the <span> inside


I'm building a Python web scraper that goes through an eBay search results page (In this case 'Gaming laptops') and grabs the title of each item for sale. I'm using BeautifulSoup to first grab the h1 tag where each title is stored, then print it out as text:

    for item_name in soup.findAll('h1', {'class': 'it-ttl'}):
    print(item_name.text)

However, within each h1 tag with the class of 'it-ttl', there is also a span tag that contains some text:

<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>

My current program prints out both the contents of the span tag AND the item title: My console output

Could someone explain to me how to grab just the item title while ignoring the span tag containing the "Details About" text? Thanks!


Solution

  • It can be done by just removing the offending <span>:

    item = """
    <h1 class="it-ttl" itemprop="name" id="itemTitle">
     <span class="g-hdn">Details about  &nbsp;</span>
     Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
    </h1>
    """
    from bs4 import BeautifulSoup as bs
    soup = bs(item,'lxml')
    target = soup.select_one('h1')
    target.select_one('span').decompose()
    print(target.text.strip())
    

    Output:

    Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…