Search code examples
pythonxpathlxmlyahoo-finance

Python/lxml/xpath for parsing Yahoo Finance


EDIT: I have provided the EXACT source code I'm using to try to figure out this issue.

I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml. An example of a page I'm trying to extract this information from is http://finance.yahoo.com/q/bs?s=FAST+Balance+Sheet&annual .

I've already successfully extracted the data on "total assets" from Smartmoney. An example of a Smartmoney page I'm able to parse is http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView .

Here is a special test script I set up to work on this issue:

    import urllib
    import lxml
    import lxml.html 

    url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView" 
    result1 = urllib.urlopen(url_local1)
    element_html1 = result1.read()
    doc1 = lxml.html.document_fromstring (element_html1)
    list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
    print list_row1

    url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" 
    result2 = urllib.urlopen(url_local2)
    element_html2 = result2.read()
    doc2 = lxml.html.document_fromstring (element_html2)
    list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
    print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.

The source code of the table row on the Smartmoney page is:

    <tr class="odd bold">
<th><div style='font-weight:bold'>Total Assets</div></th>
<td>  1,684,948</td>
<td>  1,468,283</td>                                
<td>  1,327,358</td>                                
<td>  1,304,149</td>                                    
<td>  1,163,061</td>
    </tr>

The source code of the table row on the Yahoo page is:

    <tr>
<td colspan="2"><strong>Total Assets</strong></td>
<td align="right"><strong>1,684,948&nbsp;&nbsp;</strong></td>
<td align="right"><strong>1,468,283&nbsp;&nbsp;</strong></td>
<td align="right"><strong>1,327,358&nbsp;&nbsp;</strong></td>
    </tr>

Solution

  • Contains syntax errors, should be td/strong/text() at the end, plus you have a trailing ]. I'd say that the correct query would be:

    xpath('//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
    

    Result:

    >>> tree.xpath('//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
    [u'1,684,948\xa0\xa0', u'1,468,283\xa0\xa0', u'1,327,358\xa0\xa0']
    

    In the original page the "Total Assets" <strong> container has whitespace and linebreaks. Use the additional normalize-space function on the text() result like so:

    xpath('//td[strong[normalize-space(text())="Total Assets"]]/following-sibling::td/strong/text()')