python html web-scraping beautifulsoup urlopen

How to identify and follow a link, then print data from a new webpage with BeautifulSoup

I am trying to (1) grab a title from a webpage, (2) print the title, (3) follow a link to the next page, (4) grab the title from the next page, and (5) print the title from the next page.

Steps (1) and (4) are the same function and steps (2) and (5) are the same function. The only difference is the functions (4) and (5) are being performed on the next page.

#Imports
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


##Internet
#Link to webpage 
web_page = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
#Soup object
soup = BeautifulSoup(web_page, 'html.parser')

I am not having any problems with steps 1 and 2. My code is able to get the title and print it effectively. Steps 1 and 2:

##Get Data
def get_title():
    #Patent Number
    Patent_Number = soup.title.text
    print(Patent_Number)

get_title()

The output I am getting is exactly what I want:

#Print Out
United States Patent: 10530579

I am having trouble with step 3. For step (3), I have been able to identify the right link, but not follow it to the next page. I am identifying the link I want, the 'href' above the image tag.

Picture of link to follow.

The following code is my working draft for steps 3,4, and 5:

#Get
def get_link():
    ##Internet
    #Link to webpage 
    html = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
    #Soup object
    soup = BeautifulSoup(html, 'html.parser')
    #Find image
    ##image = <img valign="MIDDLE" src="/netaicon/PTO/nextdoc.gif" border="0" alt="[NEXT_DOC]">
    #image = soup.find("img", valign="MIDDLE")
    image = soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]")
    #Get new link
    new_link = link.attrs['href']
    print(new_link)

get_link()

The output I am getting:

#Print Out
##/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"

The output is the exact link I want to follow. In short, the function I am trying to write will open the new_link variable as a new webpage, and perform the same functions performed in (1) and (2) on the new webpage. The resulting output will be two titles instead of one (one for the webpage and one for the new webpage).

In essence, I need to write a:

urlopen(new_link)

function, instead of a:

print(new_link)

function. Then, perform steps 4 and 5 on the new webpage. However, I am having trouble figuring out out to open the new page and grab the title. One problem is that new_link is not a url, but is instead a link I want to click.

Solution

Took the opportunity to clean up your code. I removed the unnecessary import of re and simplified your functions:

from urllib.request import urlopen
from bs4 import BeautifulSoup


def get_soup(web_page):
    web_page = urlopen(web_page)
    return BeautifulSoup(web_page, 'html.parser')

def get_title(soup):
    return soup.title.text  # Patent Number

def get_next_link(soup):
    return soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]").parent['href']

base_url = 'http://patft.uspto.gov'
web_page = base_url + '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22'

soup = get_soup(web_page)

get_title(soup)
> 'United States Patent: 10530579'

get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'

soup = get_soup(base_url + get_next_link(soup))
get_title(soup)
> 'United States Patent: 10529534'

get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=33&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'