python web-scraping beautifulsoup python-requests pypdf

Scraping PDFs from multiple pages using bs4

I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3

Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:

# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
 
import sys 
import requests
import bs4 
import PyPDF2 
#import PDfMiner 
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")

#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
    a_tag = tr_tag.find("a")
    urls.append(a_tag.attrs["href"])

print(urls)

A few things I could use help with:

How do I pull the links from pages 1 - 50 (arbitrary in the 'Previous Meetings' site, instead of just the first page?
How do I go about entering each of the links, and pulling the 'Read the minutes' PDFs for text analysis (using PyPDF2?)

Any help is so appreciated! Thank you in advance!

EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:

PDF_file_name	PDF_text
spec20210729min	[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min	[[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw

Solution

Welcome to the exciting world of web scraping!

First of all, great job you were on the good track. There are a few points to discuss though.

You essentially have 2 problems here.

1 - How to retrieve the HTML text for all pages (1, ..., 50)?

In web scraping you have mainly to kind of web pages:

If you are lucky, the page does not render using javascript and you can use only requests to get the page content
You are less lucky, and the page uses JavaScript to render partly or entirely

To get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page. Why? If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button. Unfortunately, we can't render JavaScript using requests

But we have a solution: Headless Browsers (wiki).

In the solution, I use selenium, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.

So we first get the web page with selenium, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.

2 - How to extract the text from the PDFs after getting them?

After downloading the PDfs, we can load it into a variable then open it with PyPDF2 and extract the text from all pages. I let you look at the solution code.

Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:

import os
import time
from io import BytesIO
from urllib.parse import urljoin

import pandas as pd
import PyPDF2
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Create a headless chromedriver to query and perform action on webpages like a browser
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

# Main url
my_url = (
    "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
)


def get_n_first_pages(n: int):
    """Get the html text for the first n pages

    Args:
        n (int): The number of pages we want

    Returns:
        List[str]: A list of html text
    """

    # Initialize the variables containing the pages
    pages = []

    # We query the web page with our chrome driver.
    # This way we can iteratively click on the next link to get all the pages we want
    driver.get(my_url)
    # We append the page source code
    pages.append(driver.page_source)

    # Then for all subsequent pages, we click on next and wait to get the page
    for _ in range(1, n):
        driver.find_element_by_css_selector(
            "#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
        ).click()
        # Wait for the page to load
        time.sleep(1)
        # Append the page
        pages.append(driver.page_source)
    return pages


def get_pdf(link: str):
    """Get the pdf text, per PDF pages, for a given link.

    Args:
        link (str): The link where we can retrieve the PDF

    Returns:
        List[str]: A list containing a string per PDF pages
    """

    # We extract the file name
    pdf_name = link.split("/")[-1].split(".")[0]

    # We get the page containing the PDF link
    # Here we don't need the chrome driver since we don't have to click on the link
    # We can just get the PDF using requests after finding the href
    pdf_link_page = requests.get(link)
    page_soup = soup(pdf_link_page.text, "lxml")
    # We get all <a> tag that have href attribute, then we select only the href
    # containing min.pdf, since we only want the PDF for the minutes
    pdf_link = [
        urljoin(link, l.attrs["href"])
        for l in page_soup.find_all("a", {"href": True})
        if "min.pdf" in l.attrs["href"]
    ]
    # There is only one PDF for the minutes so we get the only element in the list
    pdf_link = pdf_link[0]

    # We get the PDF with requests and then get the PDF bytes
    pdf_bytes = requests.get(pdf_link).content
    # We load the bytes into an in memory file (to avoid saving the PDF on disk)
    p = BytesIO(pdf_bytes)
    p.seek(0, os.SEEK_END)

    # Now we can load our PDF in PyPDF2 from memory
    read_pdf = PyPDF2.PdfFileReader(p)
    count = read_pdf.numPages
    pages_txt = []
    # For each page we extract the text
    for i in range(count):
        page = read_pdf.getPage(i)
        pages_txt.append(page.extractText())

    # We return the PDF name as well as the text inside each pages
    return pdf_name, pages_txt


# Get the first 2 pages, you can change this number
pages = get_n_first_pages(2)


# Initialize a list to store each dataframe rows
df_rows = []

# We iterate over each page
for page in pages:
    page_soup = soup(page, "lxml")

    # Here we get only the <a> tag inside the tbody and each tr
    # We avoid getting the links from the head of the table
    all_links = page_soup.select("tbody tr a")
    # We extract the href for only the links containing council (we don't care about the
    # video link)
    minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]

    #
    for link in minutes_links:
        pdf_name, pages_text = get_pdf(link)

        df_rows.append(
            {
                "PDF_file_name": pdf_name,
                # We join each page in the list into one string, separting them with a line return
                "PDF_text": "\n".join(pages_text),
            }
        )

        break
    break

# We create the data frame from the list of rows
df = pd.DataFrame(df_rows)

Outputs a dataframe like:

        PDF_file_name                                           PDF_text
    0  spec20210729ag   \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
...

Keep scraping the web, it's fun :)