Search code examples
pythonbeautifulsoupwikipedia

Scraping links from the Abstract, and Early Life sections separately


I want to scrape all the links from the Abstract and Early Life section of this page. https://en.wikipedia.org/wiki/Barack_Obama I want to store the links I get from the two sections separately. However, I am having issues with isolating that tag/class. I tried to use class "mw-headline" for the early life section but that is returning only the header text. Any hints are really appreciated.

I couldn't figure out how to the get abstract and early life sections separately.

url='https://en.wikipedia.org/wiki/Barack_Obama'
response = requests.get(url)
soup=bs(response.content,'html.parser')
page=soup.find('div',attrs={'id':'bodyContent'})

early_life=page.findAll('span',attrs={'class':'mw-headline'})

Solution

  • It's not very clear what format you want your output to be in, but the following will produce a list of dictionaries with the sections in several different formats:

    First, the sections and abstract are all inside this div, and are not nested any further into separate elements - so this starts by selecting the whole outer element and then going through its children:

    content = soup.select_one('#mw-content-text > .mw-parser-output').children
    splitContent = []
    

    (splitContent is the list that will be filled up with a dictionary for each section.)

    for c in content:
        if c.name == 'h2' or splitContent == []:
            sectionName = 'Abstract' if splitContent == [] else c.text
            splitContent.append({
                'section': sectionName, 
                'listSoups': [], 'HTML': '', 'asText': ''
            }) 
    
        splitContent[-1]['listSoups'].append(c)
        splitContent[-1]['HTML'] += str(c)
        if c.name not in ['style', 'script']: 
            splitContent[-1]['asText'] += c.text
    

    Each section header is wrapped as h2*, so every time the loop gets to a child tag that's h2, a new dictionary is started, and the child object itself is always added to listSoups in the last dictionary of the splitContent list.

    HTML is saved too, so if you want a single bs4 object to be created for each section, splitContent can be looped through:

    for i in range(len(splitContent)):
        splitContent[i]['asSoup'] = BeautifulSoup(splitContent[i]['HTML'], 'html.parser')
    

    Now, you can see any of the sections in any of the formats added to the dictionaries.

    Note that listSoups is not the same as asSoup. listSoups is a list, and each item within is still connected to the original soup variable and you can view its parent, nextSibling, etc in ways that would not be possible with asSoup, which is a single object.


    *Btw, using {'class':'mw-headline'} will give you not just the main section headers, but also the subheaders. You can actually get something like a tree of the article structure with:

    for h in soup.findAll('span',attrs={'class':'mw-headline'}):
        hLevel = int(h.parent.name.replace('h', ''))
        print(('\t'*(hLevel-2))+'↳', f'[{h.parent.name}] {h.text}')
    

    Additional EDIT:

    To get a dictionary of section texts, just use

    sectnTexts_dict = dict([(
        sc['section'].replace(' ', '_'), # section name to key
        sc['asText'] # section text as value
    ) for sc in splitContent])
    

    to view a truncated version, print dict((k, v[:50]+'...') for k, v in sectnTexts_dict.items()), which looks like

    {
       "Abstract": "44th President of the United States\n\"Barack\" and \"...",
       "Early_life_and_career": "Early life and career\nMain article: Early life and...",
       "Legal_career": "Legal career\nCivil Rights attorney\nHe joined Davis...",
       "Legislative_career": "Legislative career\nIllinois Senate (1997\u20132004)\nMai...",
       "Presidential_campaigns": "Presidential campaigns\n2008\nMain articles: 2008 Un...",
       "Presidency_(2009\u20132017)": "Presidency (2009\u20132017)\n First official portrait of...",
       "Cultural_and_political_image": "Cultural and political image\nMain article: Public ...",
       "Post-presidency_(2017\u2013present)": "Post-presidency (2017\u2013present)\n Obama with his the...",
       "Legacy": "Legacy\n Job growth during the presidency of Obama ...",
       "Bibliography": "Bibliography\nMain article: Bibliography of Barack ...",
       "See_also": "See also\n\n\nBiography portal\nUnited States portal\nC...",
       "References": "References\n\n^ \"Barack Hussein Obama Takes The Oath...",
       "Further_reading": "Further reading\n\nDe Zutter, Hank (December 8, 1995...",
       "External_links": "External links\nLibrary resources about  Barack Oba..."
    }