Search code examples
python-2.7web-scrapingbeautifulsoupurllib2web-mining

Why can't I extract the subheading of a page using BeautifulSoup?


I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:

<div class="person-info">
    <div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
    <div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>

So I coded my script as follows:

import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time

def get_FamSearch():

    link = "https://example.org/pal:/MM9.1.1/KH11-999"
    openLink = urllib2.urlopen(link)
    Soup_FamSearch = BeautifulSoup(openLink, "html")
    openLink.close()

    NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
    if NameParentTag:
        Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
        name_decode = Name.encode("ascii", "ignore")
        print name_decode

    SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
    if SubheadTag:
        print SubheadTag.get_text(strip=True)

get_FamSearch()

This is the results, without able to locate and extract the subheading:

Helen Brad
[Finished in 2.2s]

Solution

  • The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.

    The data you need is presented differently, here's what works for me:

    print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()
    

    Prints:

    Canada Census, 1901