Search code examples
pythonhtmlcsvweb-scrapingbeautifulsoup

Scraping values from HTML header and saving as a CSV file in Python


All,

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.

I am sure it's something simple but need a bit of help if possible!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

Solution

  • Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!

    Try starting like this:

    scripttag = soup.head.findAll("script")[1]
    javascriptdata = scripttag.contents[0]
    

    Then you can use:

    1. partition('=')[2] to cut off the "var gs_csv" bit.
    2. strip(' \n"') to remove unwanted characters at each end (space, newline, ")
    3. replace("\\n","\n") to sort out the new lines.

    Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....

    Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:

    for line in data.splitlines():
        row = line.split(",")