I am working on my Supply Chain Management college project and want to analyze daily postings on a website to analyze and document industry's demand for services/products. Particular page that is being changed every day and with different amount of containers and pages :
Code generates csv file ( do not mind headers) by scraping the HTML tags and documenting data points. Tried to use 'for' loop but code still scans only first page.
Python Knowledge level : Beginner, learn the 'hard-way' through youtube and googling. Found example that worked for my level of understanding but have troubles with combining people's different solutions.
import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup
for page in range (1,3):my_url = 'https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})
filename = "BuyandSell.csv"
f = open(filename, "w")
headers = "Title, Publication Date, Closing Date, GSIN, Notice Type, Procurement Entity\n"
f.write(headers)
for container in containers:
Title = container.h2.text
publication_container = container.findAll("dd",{"class":"data publication-date"})
Publication_date = publication_container[0].text
closing_container = container.findAll("dd",{"class":"data date-closing"})
Closing_date = closing_container[0].text
gsin_container = container.findAll("li",{"class":"first"})
Gsin = gsin_container[0].text
notice_container = container.findAll("dd",{"class":"data php"})
Notice_type = notice_container[0].text
entity_container = container.findAll("dd",{"class":"data procurement-entity"})
Entity = entity_container[0].text
print("Title: " + Title)
print("Publication_date: " + Publication_date)
print("Closing_date: " + Closing_date)
print("Gsin: " + Gsin)
print("Notice: " + Notice_type)
print("Entity: " + Entity)
f.write(Title + "," +Publication_date + "," +Closing_date + "," +Gsin + "," +Notice_type + "," +Entity +"\n")
f.close()
Actual Results :
Code generates CSV file only for the first page.
Code does not write on top of what was already scanned ( from day to day ) at least
Expected Results :
Code scans next pages and recognizes when there are no pages to go through.
CSV file would generate 10 csv lines per page. ( And whatever amount would be on the last page, as the number is not always 10).
Code would write on top of what was already scraped ( for more advanced analytics using Excel tools with historic data)
Some might say using pandas is overkill, but personally I'm comfortable using it and just like using it to create tables and write to file.
there also probably a bit of a more robust way to go page to page, but I just wanted to get this to you and you can work with it.
As of now, I just hard code in the next page value (and I just arbitrarily picked 20 pages as a max.) So it's start with page 1, and then go through 20 pages (or stop once it gets to an invalid page).
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
filename = "BuyandSell.csv"
# Initialize an empty 'results' dataframe
results = pd.DataFrame()
# Iterarte through the pages
for page in range(0,20):
url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})
# Get data from each container
if containers != []:
for each in containers:
title = each.find('h2').text.strip()
publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
gsin = each.find('dd', {'class':'data gsin'}).text.strip()
notice_type = each.find('dd', {'class':'data php'}).text.strip()
procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()
# Create 1 row dataframe
temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity'])
# Append that row to a 'results' dataframe
results = results.append(temp_df).reset_index(drop=True)
print ('Aquired page ' + str(page+1))
else:
print ('No more pages')
break
# If already have a file saved
if os.path.isfile(filename):
# Read in previously saved file
df = pd.read_csv(filename)
# Append the newest results
df = df.append(results).reset_index()
# Drop and duplicates (incase the newest results aren't really new)
df = df.drop_duplicates()
# Save the previous file, with appended results
df.to_csv(filename, index=False)
else:
# If a previous file not already saved, save a new one
df = results.copy()
df.to_csv(filename, index=False)