I am trying to (1) grab a title from a webpage, (2) print the title, (3) follow a link to the next page, (4) grab the title from the next page, and (5) print the title from the next page.
Steps (1) and (4) are the same function and steps (2) and (5) are the same function. The only difference is the functions (4) and (5) are being performed on the next page.
#Imports
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
##Internet
#Link to webpage
web_page = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
#Soup object
soup = BeautifulSoup(web_page, 'html.parser')
I am not having any problems with steps 1 and 2. My code is able to get the title and print it effectively. Steps 1 and 2:
##Get Data
def get_title():
#Patent Number
Patent_Number = soup.title.text
print(Patent_Number)
get_title()
The output I am getting is exactly what I want:
#Print Out
United States Patent: 10530579
I am having trouble with step 3. For step (3), I have been able to identify the right link, but not follow it to the next page. I am identifying the link I want, the 'href' above the image tag.
The following code is my working draft for steps 3,4, and 5:
#Get
def get_link():
##Internet
#Link to webpage
html = urlopen("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22")
#Soup object
soup = BeautifulSoup(html, 'html.parser')
#Find image
##image = <img valign="MIDDLE" src="/netaicon/PTO/nextdoc.gif" border="0" alt="[NEXT_DOC]">
#image = soup.find("img", valign="MIDDLE")
image = soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]")
#Get new link
new_link = link.attrs['href']
print(new_link)
get_link()
The output I am getting:
#Print Out
##/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"
The output is the exact link I want to follow. In short, the function I am trying to write will open the new_link variable as a new webpage, and perform the same functions performed in (1) and (2) on the new webpage. The resulting output will be two titles instead of one (one for the webpage and one for the new webpage).
In essence, I need to write a:
urlopen(new_link)
function, instead of a:
print(new_link)
function. Then, perform steps 4 and 5 on the new webpage. However, I am having trouble figuring out out to open the new page and grab the title. One problem is that new_link is not a url, but is instead a link I want to click.
Took the opportunity to clean up your code. I removed the unnecessary import of re
and simplified your functions:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_soup(web_page):
web_page = urlopen(web_page)
return BeautifulSoup(web_page, 'html.parser')
def get_title(soup):
return soup.title.text # Patent Number
def get_next_link(soup):
return soup.find("img", valign="MIDDLE", alt="[NEXT_DOC]").parent['href']
base_url = 'http://patft.uspto.gov'
web_page = base_url + '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=31&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/%22deep+learning%22'
soup = get_soup(web_page)
get_title(soup)
> 'United States Patent: 10530579'
get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=32&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'
soup = get_soup(base_url + get_next_link(soup))
get_title(soup)
> 'United States Patent: 10529534'
get_next_link(soup)
> '/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=33&f=G&l=50&co1=AND&d=PTXT&s1=(%22deep+learning%22.CLTX.+or+%22deep+learning%22.DCTX.)&OS=ACLM/"deep+learning"'