python-2.7 web-scraping beautifulsoup urllib2 web-mining

Why can't I extract the subheading of a page using BeautifulSoup?

I am trying to extract the name and subheading of this page (for example). I have no problem extracting the name, but it's unsuccessful for the subheading. Using inspect element in Chrome, I identified that the subheading text "Canada Census, 1901" is embedded as follows:

<div class="person-info">
    <div class="title ng-binding">Helen Brad in household of Geo Wilcock</div>
    <div class="subhead ng-scope ng-binding" data-ng-if="!recordPersonCentric">Canada Census, 1901</div>

So I coded my script as follows:

import urllib2
import re
import csv
from bs4 import BeautifulSoup
import time

def get_FamSearch():

    link = "https://example.org/pal:/MM9.1.1/KH11-999"
    openLink = urllib2.urlopen(link)
    Soup_FamSearch = BeautifulSoup(openLink, "html")
    openLink.close()

    NameParentTag = Soup_FamSearch.find("tr", class_="result-item highlight-person")
    if NameParentTag:
        Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
        name_decode = Name.encode("ascii", "ignore")
        print name_decode

    SubheadTag = Soup_FamSearch.find("div", class_="subhead ng-scope ng-binding")
    if SubheadTag:
        print SubheadTag.get_text(strip=True)

get_FamSearch()

This is the results, without able to locate and extract the subheading:

Helen Brad
[Finished in 2.2s]

Solution

The page you are getting via urllib2 doesn't contain the div with subhead class. The actual heading is constructed asynchronously with the help of javascript being executed on the browser-side.

The data you need is presented differently, here's what works for me:

print Soup_FamSearch.find('dt', text='Title').find_next_sibling('dd').text.strip()

Prints:

Canada Census, 1901