Search code examples
web-scrapingbeautifulsouphtml-parsing

Select class name in HTML Parser containing extra words


I am trying to scrape a web page. I want to get reviews. But the reviews are of three categories, some are positive, some are neutral and some are negative. I am using html parser and have accessed many tags. But for the class which can be in three categories, how can I get them:

<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>

I have a python container for each div containing each item:

# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})`

for container in containers:
    title = container.findAll(a).text #This gives me titles
    ##Similarly I need the reviews of each of them here
    review = container.findAll("div", {"class": "review "}))#along with review there is positive, neutral and negative word also according to the type of review

Solution

  • using regex, you can get the classes that contain the substring "review".

    import re
    
    for container in containers:
        title = container.findAll(a).text #This gives me titles
    
        review = container.findAll("div", {"class": re.compile(r'review')})
    

    See the difference:

    html = '''<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
    <div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
    <div class="review neutral" title="" style="background-color: #FFFF00;">6</div>'''
    
    from bs4 import BeautifulSoup
    import re
    
    soup = BeautifulSoup(html, 'html.parser')
    review = soup.find_all('div', {'class':'review '})
    print ('No regex: ',review)
    
    print('\n')
    
    review = soup.findAll("div", {"class": re.compile(r'review')})
    print ('Regex: ',review)
    

    Output:

    No regex:  []
    
    
    Regex:  [<div class="review positive" style="background-color: #00B551;" title="">9.3</div>, <div class="review negative" style="background-color: #FF0000;" title="">4.8</div>, <div class="review neutral" style="background-color: #FFFF00;" title="">6</div>]