Search code examples
pythonweb-scrapingbeautifulsouppython-requestspypdf

Scraping PDFs from multiple pages using bs4


I'm a python beginner and I'm hoping that what I'm trying to do isn't too involved. Essentially, I want to extract the text of the minutes (contained in PDF documents) from this municipality's council meetings for the last ~10 years at this website: https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3

Eventually, I want to analyze/categorise the action items from the meeting minutes. All I've been able to do so far is grab the links leading to the PDFs from the first page. Here is my code:

# Import requests for navigating to websites, beautiful soup to scrape website, PyPDF2 for PDF data mining
 
import sys 
import requests
import bs4 
import PyPDF2 
#import PDfMiner 
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

# Soupify URL
my_url = "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
result = requests.get(my_url)
src = result.content
page_soup = soup(src, "lxml")

#list with links
urls = []
for tr_tag in page_soup.find_all("tr"):
    a_tag = tr_tag.find("a")
    urls.append(a_tag.attrs["href"])

print(urls)

A few things I could use help with:

  • How do I pull the links from pages 1 - 50 (arbitrary in the 'Previous Meetings' site, instead of just the first page?
  • How do I go about entering each of the links, and pulling the 'Read the minutes' PDFs for text analysis (using PyPDF2?)

Any help is so appreciated! Thank you in advance!

EDIT: I am hoping to get the data into a dataframe, where the first column is the file name and the second column is the text from the PDF. It would look like:

PDF_file_name PDF_text
spec20210729min [[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nJULY 29, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw
spec20210802min [[' \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING MINUTES\n \n \nAUGUST 2, 2021\n \n \nA Special Meeting of the Council\n \nof the City of Vancouver\n \nw

Solution

  • Welcome to the exciting world of web scraping!

    First of all, great job you were on the good track. There are a few points to discuss though.

    You essentially have 2 problems here.

    1 - How to retrieve the HTML text for all pages (1, ..., 50)?

    In web scraping you have mainly to kind of web pages:

    1. If you are lucky, the page does not render using javascript and you can use only requests to get the page content
    2. You are less lucky, and the page uses JavaScript to render partly or entirely

    To get all the pages from 1 to 50, we need to somehow click on the button next at the end of the page. Why? If you check what happens in the network tab from the browser developer, console, you see that a new query getting a JS script to generate the page is fetched for each click to the next button. Unfortunately, we can't render JavaScript using requests

    But we have a solution: Headless Browsers (wiki).

    In the solution, I use selenium, which is a library that can use a real browser driver (in our case Chrome) to query a page and render JavaScript.

    So we first get the web page with selenium, we extract the HTML, we click on next and wait a bit for the page to load, we extract the HTML, ... and so on.

    2 - How to extract the text from the PDFs after getting them?

    After downloading the PDfs, we can load it into a variable then open it with PyPDF2 and extract the text from all pages. I let you look at the solution code.

    Here is a working solution. It will iterate over the first n pages you want and return the text from all the PDF you are interested in:

    import os
    import time
    from io import BytesIO
    from urllib.parse import urljoin
    
    import pandas as pd
    import PyPDF2
    import requests
    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    # Create a headless chromedriver to query and perform action on webpages like a browser
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
    
    # Main url
    my_url = (
        "https://covapp.vancouver.ca/councilMeetingPublic/CouncilMeetings.aspx?SearchType=3"
    )
    
    
    def get_n_first_pages(n: int):
        """Get the html text for the first n pages
    
        Args:
            n (int): The number of pages we want
    
        Returns:
            List[str]: A list of html text
        """
    
        # Initialize the variables containing the pages
        pages = []
    
        # We query the web page with our chrome driver.
        # This way we can iteratively click on the next link to get all the pages we want
        driver.get(my_url)
        # We append the page source code
        pages.append(driver.page_source)
    
        # Then for all subsequent pages, we click on next and wait to get the page
        for _ in range(1, n):
            driver.find_element_by_css_selector(
                "#LiverpoolTheme_wt93_block_wtMainContent_RichWidgets_wt132_block_wt28"
            ).click()
            # Wait for the page to load
            time.sleep(1)
            # Append the page
            pages.append(driver.page_source)
        return pages
    
    
    def get_pdf(link: str):
        """Get the pdf text, per PDF pages, for a given link.
    
        Args:
            link (str): The link where we can retrieve the PDF
    
        Returns:
            List[str]: A list containing a string per PDF pages
        """
    
        # We extract the file name
        pdf_name = link.split("/")[-1].split(".")[0]
    
        # We get the page containing the PDF link
        # Here we don't need the chrome driver since we don't have to click on the link
        # We can just get the PDF using requests after finding the href
        pdf_link_page = requests.get(link)
        page_soup = soup(pdf_link_page.text, "lxml")
        # We get all <a> tag that have href attribute, then we select only the href
        # containing min.pdf, since we only want the PDF for the minutes
        pdf_link = [
            urljoin(link, l.attrs["href"])
            for l in page_soup.find_all("a", {"href": True})
            if "min.pdf" in l.attrs["href"]
        ]
        # There is only one PDF for the minutes so we get the only element in the list
        pdf_link = pdf_link[0]
    
        # We get the PDF with requests and then get the PDF bytes
        pdf_bytes = requests.get(pdf_link).content
        # We load the bytes into an in memory file (to avoid saving the PDF on disk)
        p = BytesIO(pdf_bytes)
        p.seek(0, os.SEEK_END)
    
        # Now we can load our PDF in PyPDF2 from memory
        read_pdf = PyPDF2.PdfFileReader(p)
        count = read_pdf.numPages
        pages_txt = []
        # For each page we extract the text
        for i in range(count):
            page = read_pdf.getPage(i)
            pages_txt.append(page.extractText())
    
        # We return the PDF name as well as the text inside each pages
        return pdf_name, pages_txt
    
    
    # Get the first 2 pages, you can change this number
    pages = get_n_first_pages(2)
    
    
    # Initialize a list to store each dataframe rows
    df_rows = []
    
    # We iterate over each page
    for page in pages:
        page_soup = soup(page, "lxml")
    
        # Here we get only the <a> tag inside the tbody and each tr
        # We avoid getting the links from the head of the table
        all_links = page_soup.select("tbody tr a")
        # We extract the href for only the links containing council (we don't care about the
        # video link)
        minutes_links = [x.attrs["href"] for x in all_links if "council" in x.attrs["href"]]
    
        #
        for link in minutes_links:
            pdf_name, pages_text = get_pdf(link)
    
            df_rows.append(
                {
                    "PDF_file_name": pdf_name,
                    # We join each page in the list into one string, separting them with a line return
                    "PDF_text": "\n".join(pages_text),
                }
            )
    
            break
        break
    
    # We create the data frame from the list of rows
    df = pd.DataFrame(df_rows)
    

    Outputs a dataframe like:

            PDF_file_name                                           PDF_text
        0  spec20210729ag   \n \n \n \n \n \n \nSPECIAL COUNCIL MEET\nING...
    ...
    

    Keep scraping the web, it's fun :)