Search code examples
pythonweb-scrapingbeautifulsouph2findall

How to Extract Multiple H2 Tags Using BeautifulSoup


import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  article = {
    'H2_Heading': h2_headings,
  }

  print('Added article:', article)
  articlelist.append(article)

df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

The webpage used within the script has multiple H2 Heading tags that I want to scrape.

I'm looking for a way to simply scrape all the H2 Heading text as shown below:

ANGRY BIRDS 2, ANGRY BIRDS DREAM BLAST, ANGRY BIRDS FRIENDS, ANGRY BIRDS MATCH, ANGRY BIRDS BLAST, ANGRY BIRDS POP

Issue

When i use the syntax h2_headings = item.find('h2').text it exacts the first h2 heading text as expected.

However, I need to capture all instances of the H2 tag. When I use h2_headings = item.find_all('h2') it returns the follow results:

{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}

Amending the statement to h2_headings = item.find_all('h2').text.strip() returns the following error:

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Any help would be greatly appreciated.


Solution

  • You can do that as follows:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    articlelist = []
    url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
    
    r = requests.get(url)
    #print(r.status_code)
    
    soup = BeautifulSoup(r.content, features='lxml')
    articles = soup.find_all('div', class_ = 'post-body__container')
    
    
    for item in articles:
        h2=', '.join([x.get_text() for x in item.find_all('h2')])
        print(h2)
      
    
    #   print('Added article:', article)
    #   articlelist.append(article)
    
    # df = pd.DataFrame(articlelist)
    #df.to_csv('articlelist.csv', index=False)
    #print('Saved to csv')
    

    Output:

    Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP