Search code examples
pythoncalendarbeautifulsoupscreen-scraping

Dynamic web database scraping in Python


Anyone have any idea what's wrong with the below code? It grabs the data from one day only. The webpage is however a dynamic web database and contains data for a number of years. I need to scrape the data for each month and day within 2013-2016 and store into a CSV file.

import calendar
import requests
from bs4 import BeautifulSoup

cal = calendar.Calendar()

base_url = 'http://www.pse.pl/index.php?modul=21&id_rap=24&data=2016'
month_url = '&Month='
day_url = '&Day='

for year in range(2015, 2017):
    for month in range(1, 13):
        monthdays = [d for d in cal.itermonthdays(year, month) if d != 0]
        for day in monthdays:
            r = requests.get(base_url + str(year) + month_url + str(month) + day_url + str(day))
            soup = BeautifulSoup(r.text,'lxml')
            findtable = soup.find('table',{'id':'tabela'})
            for i in findtable.findAll('tr'):
                for j in i.findAll('td'):
                    print (j.text)

Solution

  • The main thing I noticed is that the date must be formatted differently in the url that you send to the site. Hyphens are required between elements of the date (year, month and day). I can also suggest an easier way of moving through the days.

    Arrows is a module for working with dates. In this case I use it for generating a range of days from 1 January 2015 through 3 January 2015 inclusive.

    I also use it for formatting these dates.

    I find the tables using BeautifulSoup and then use pandas for extracting the contents of the tables into a dataframe. Finally, I write those dataframes into csv files.

    >>> import requests
    >>> import arrow
    >>> import bs4
    >>> from datetime import datetime
    >>> import pandas as pd
    >>> start = arrow.get(datetime(2015,1,1))
    >>> end = arrow.get(datetime(2015,1,3))
    >>> base_url = 'http://www.pse.pl/index.php?modul=21&id_rap=24&data='
    >>> for day in arrow.Arrow.range('day', start, end):
    ...     page = requests.get(base_url+day.format('YYYY-MM-DD')).content
    ...     soup = bs4.BeautifulSoup(page, 'lxml')
    ...     table = soup.find('table' ,{'id':'tabela'})
    ...     df = pd.read_html(str(table))
    ...     df[0].to_csv(day.format('YYYY-MM-DD')+'.csv')