Search code examples
pythonparsingcsvfinancial

Parse csv to dict


I am trying to parse csv financial data from the web into a dict that I can navigate through by key.
I am failing using csv.DictReader.
I have:

import csv
import urllib2
req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
response = urllib2.urlopen(req)
response.next() 
csvio = (csv.DictReader(response))
print csvio.fieldnames 
print len(csvio.fieldnames)

Edited to reflect changes from answer below.

This almost gets me there, but I need to strip the leading 'Fiscal year...share data' before feeding it to DictReader. How best to do this? I have looked at converting to string and stripping lead chars with str.lstrip() as the docs say here with no luck.


Solution

  • To use a DictReader you need to either specify the field names, or the field names need to be the first row of the csv data (ie. a header row).

    In the csv file that your code retrieves, the field names are in the second row of data, not the first. What I've done is to throw out the first line of data before passing the csv file to the DictReader constructor.

    In response to your updated question: Stripping the leading text from the header row probably isn't desirable as this is acting as the field name for the first column of data. Probably better to discard the first 2 rows of data and then supply the desired field names directly to the DictReader. I have updated the example below to reflect this.

    import csv
    import urllib2
    
    req = urllib2.Request('http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:BRCM&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=desc&columnYear=5&rounding=3&view=raw&r=886300&denominatorView=raw&number=3')
    response = urllib2.urlopen(req)
    
    response.readline() # This reads (and discards) the first row of data which is not needed.
    response.readline() # skip the 
    
    myFieldnames = ["firstColName", "TTM", "2012", "2011", "2010", "2009", "2008"]
    csvio = csv.DictReader(response, fieldnames=myFieldnames)
    
    print csvio.fieldnames 
    for row in csvio:
        print row