Anyone have any idea what's wrong with the below code? It grabs the data from one day only. The webpage is however a dynamic web database and contains data for a number of years. I need to scrape the data for each month and day within 2013-2016 and store into a CSV file.
import calendar
import requests
from bs4 import BeautifulSoup
cal = calendar.Calendar()
base_url = 'http://www.pse.pl/index.php?modul=21&id_rap=24&data=2016'
month_url = '&Month='
day_url = '&Day='
for year in range(2015, 2017):
for month in range(1, 13):
monthdays = [d for d in cal.itermonthdays(year, month) if d != 0]
for day in monthdays:
r = requests.get(base_url + str(year) + month_url + str(month) + day_url + str(day))
soup = BeautifulSoup(r.text,'lxml')
findtable = soup.find('table',{'id':'tabela'})
for i in findtable.findAll('tr'):
for j in i.findAll('td'):
print (j.text)
The main thing I noticed is that the date must be formatted differently in the url that you send to the site. Hyphens are required between elements of the date (year, month and day). I can also suggest an easier way of moving through the days.
Arrows is a module for working with dates. In this case I use it for generating a range of days from 1 January 2015 through 3 January 2015 inclusive.
I also use it for formatting these dates.
I find the tables using BeautifulSoup and then use pandas for extracting the contents of the tables into a dataframe. Finally, I write those dataframes into csv files.
>>> import requests
>>> import arrow
>>> import bs4
>>> from datetime import datetime
>>> import pandas as pd
>>> start = arrow.get(datetime(2015,1,1))
>>> end = arrow.get(datetime(2015,1,3))
>>> base_url = 'http://www.pse.pl/index.php?modul=21&id_rap=24&data='
>>> for day in arrow.Arrow.range('day', start, end):
... page = requests.get(base_url+day.format('YYYY-MM-DD')).content
... soup = bs4.BeautifulSoup(page, 'lxml')
... table = soup.find('table' ,{'id':'tabela'})
... df = pd.read_html(str(table))
... df[0].to_csv(day.format('YYYY-MM-DD')+'.csv')