Search code examples
python-3.xweb-scrapingbeautifulsouppaginationpython-requests-html

How to make a pagination loop that scrapes specific amount of pages ( pages vary from day to day)


Summary

I am working on my Supply Chain Management college project and want to analyze daily postings on a website to analyze and document industry's demand for services/products. Particular page that is being changed every day and with different amount of containers and pages :

https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today

Bacground

Code generates csv file ( do not mind headers) by scraping the HTML tags and documenting data points. Tried to use 'for' loop but code still scans only first page.

Python Knowledge level : Beginner, learn the 'hard-way' through youtube and googling. Found example that worked for my level of understanding but have troubles with combining people's different solutions.

Code at the moment

import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup

problem starts here

for page in range (1,3):my_url = 'https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})

this part does not write in addition to existing line items

filename = "BuyandSell.csv"
f = open(filename, "w")
headers = "Title, Publication Date, Closing Date, GSIN, Notice Type, Procurement Entity\n"
f.write(headers)

for container in containers:
    Title = container.h2.text

    publication_container = container.findAll("dd",{"class":"data publication-date"})
    Publication_date = publication_container[0].text

    closing_container = container.findAll("dd",{"class":"data date-closing"})
    Closing_date = closing_container[0].text

    gsin_container = container.findAll("li",{"class":"first"})
    Gsin = gsin_container[0].text

    notice_container = container.findAll("dd",{"class":"data php"})
    Notice_type = notice_container[0].text

    entity_container = container.findAll("dd",{"class":"data procurement-entity"})
    Entity = entity_container[0].text

    print("Title: " + Title)
    print("Publication_date: " + Publication_date)
    print("Closing_date: " + Closing_date)
    print("Gsin: " + Gsin)
    print("Notice: " + Notice_type)
    print("Entity: " + Entity)

    f.write(Title + "," +Publication_date + "," +Closing_date + "," +Gsin + "," +Notice_type + "," +Entity +"\n")

f.close()

Please let me know if you would like to see further. Rest is defining data containers that are getting found in HTML code and getting printed to csv.Any help/advice would be highly appreciated. Thanks!

Actual Results :

Code generates CSV file only for the first page.

Code does not write on top of what was already scanned ( from day to day ) at least

Expected Results :

Code scans next pages and recognizes when there are no pages to go through.

CSV file would generate 10 csv lines per page. ( And whatever amount would be on the last page, as the number is not always 10).

Code would write on top of what was already scraped ( for more advanced analytics using Excel tools with historic data)


Solution

  • Some might say using pandas is overkill, but personally I'm comfortable using it and just like using it to create tables and write to file.

    there also probably a bit of a more robust way to go page to page, but I just wanted to get this to you and you can work with it.

    As of now, I just hard code in the next page value (and I just arbitrarily picked 20 pages as a max.) So it's start with page 1, and then go through 20 pages (or stop once it gets to an invalid page).

    import pandas as pd
    from bs4 import BeautifulSoup
    import requests
    import os
    
    filename = "BuyandSell.csv"
    
    # Initialize an empty 'results' dataframe
    results = pd.DataFrame()
    
    # Iterarte through the pages
    for page in range(0,20):
        url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'
    
        page_html = requests.get(url).text
        page_soup = BeautifulSoup(page_html, "html.parser")
        containers = page_soup.findAll("div",{"class":"rc"})
    
        # Get data from each container
        if containers != []:
            for each in containers:
                title = each.find('h2').text.strip()
                publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
                closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
                gsin = each.find('dd', {'class':'data gsin'}).text.strip()
                notice_type = each.find('dd', {'class':'data php'}).text.strip()
                procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()
    
                # Create 1 row dataframe
                temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity'])
    
                # Append that row to a 'results' dataframe
                results = results.append(temp_df).reset_index(drop=True)
            print ('Aquired page ' + str(page+1))
    
        else:
            print ('No more pages')
            break
    
    
    # If already have a file saved
    if os.path.isfile(filename):
    
        # Read in previously saved file
        df = pd.read_csv(filename)
    
        # Append the newest results
        df = df.append(results).reset_index()
    
        # Drop and duplicates (incase the newest results aren't really new)
        df = df.drop_duplicates()
    
        # Save the previous file, with appended results
        df.to_csv(filename, index=False)
    
    else:
    
        # If a previous file not already saved, save a new one
        df = results.copy()
        df.to_csv(filename, index=False)