I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3
Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:
# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
import sys
import requests
import bs4
import PyPDF2
#import PDfMiner
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")
#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
a_tag = tr_tag.find("a")
urls.append(a_tag.attrs["href"])
print(urls)
A few things I could use help with:
Any help is so appreciated! Thank you in advance!
EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:
PDF_file_name | PDF_text |
---|---|
spec20210729min | [[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw |
spec20210802min | [[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw |
Welcome to the exciting world of web scraping!
First of all, great job you were on the good track. There are a few points to discuss though.
You essentially have 2 problems here.
1 - How to retrieve the HTML text for all pages (1, ..., 50)?
In web scraping you have mainly to kind of web pages:
requests
to get the page contentTo get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page.
Why?
If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button.
Unfortunately, we can't render JavaScript using requests
But we have a solution: Headless Browsers (wiki).
In the solution, I use selenium
, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.
So we first get the web page with selenium
, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.
2 - How to extract the text from the PDFs after getting them?
After downloading the PDfs, we can load it into a variable then open it with PyPDF2
and extract the text from all pages. I let you look at the solution code.
Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:
import os
import time
from io import BytesIO
from urllib.parse import urljoin
import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
# Main url
my_url = (
"https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)
def get_n_first_pages(n: int):
"""Get the html text for the first n pages
Args:
n (int): The number of pages we want
Returns:
List[str]: A list of html text
"""
# Initialize the variables containing the pages
pages = []
# We query the web page with our chrome driver.
# This way we can iteratively click on the next link to get all the pages we want
driver.get(my_url)
# We append the page source code
pages.append(driver.page_source)
# Then for all subsequent pages, we click on next and wait to get the page
for _ in range(1, n):
driver.find_element_by_css_selector(
"#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
).click()
# Wait for the page to load
time.sleep(1)
# Append the page
pages.append(driver.page_source)
return pages
def get_pdf(link: str):
"""Get the pdf text, per PDF pages, for a given link.
Args:
link (str): The link where we can retrieve the PDF
Returns:
List[str]: A list containing a string per PDF pages
"""
# We extract the file name
pdf_name = link.split("/")[-1].split(".")[0]
# We get the page containing the PDF link
# Here we don't need the chrome driver since we don't have to click on the link
# We can just get the PDF using requests after finding the href
pdf_link_page = requests.get(link)
page_soup = soup(pdf_link_page.text, "lxml")
# We get all <a> tag that have href attribute, then we select only the href
# containing min.pdf, since we only want the PDF for the minutes
pdf_link = [
urljoin(link, l.attrs["href"])
for l in page_soup.find_all("a", {"href": True})
if "min.pdf" in l.attrs["href"]
]
# There is only one PDF for the minutes so we get the only element in the list
pdf_link = pdf_link[0]
# We get the PDF with requests and then get the PDF bytes
pdf_bytes = requests.get(pdf_link).content
# We load the bytes into an in memory file (to avoid saving the PDF on disk)
p = BytesIO(pdf_bytes)
p.seek(0, os.SEEK_END)
# Now we can load our PDF in PyPDF2 from memory
read_pdf = PyPDF2.PdfFileReader(p)
count = read_pdf.numPages
pages_txt = []
# For each page we extract the text
for i in range(count):
page = read_pdf.getPage(i)
pages_txt.append(page.extractText())
# We return the PDF name as well as the text inside each pages
return pdf_name, pages_txt
# Get the first 2 pages, you can change this number
pages = get_n_first_pages(2)
# Initialize a list to store each dataframe rows
df_rows = []
# We iterate over each page
for page in pages:
page_soup = soup(page, "lxml")
# Here we get only the <a> tag inside the tbody and each tr
# We avoid getting the links from the head of the table
all_links = page_soup.select("tbody tr a")
# We extract the href for only the links containing council (we don't care about the
# video link)
minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
#
for link in minutes_links:
pdf_name, pages_text = get_pdf(link)
df_rows.append(
{
"PDF_file_name": pdf_name,
# We join each page in the list into one string, separting them with a line return
"PDF_text": "\n".join(pages_text),
}
)
break
break
# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)
Outputs a dataframe like:
PDF_file_name PDF_text
0 spec20210729ag \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
...
Keep scraping the web, it's fun :)