Search code examples
python-3.xxpathweb-scrapingfinance

XPath help needed for row text search column return on MorningStar


Need to extract some data from Html that looks like:

<div class="r_tbar0 positionrelative">
    <h3>Financials</h3>
</div>
<table class="r_table1 text2" cellspacing="0" cellpadding="0">
    <thead>
        <tr>
            <th scope="row" align="left"></th>
            <th scope="col" id="Y0" align="right">2007-12</th>
            <th scope="col" id="Y1" align="right">2008-12</th>
            <!--More columns here-->
            <th scope="col" id="Y9" align="right">2016-12</th>
            <th scope="col" id="Y10" align="right">TTM</th>
        </tr>
    </thead>
    <tbody>
        <tr class="hr">
            <td colspan="12"></td>
        </tr>
        <tr>
            <th class="row_lbl" scope="row" id="i0">Revenue&nbsp;<span>USD Mil</span></th>
            <td headers="Y0 i0" align="right">5,858</td>
            <td headers="Y1 i0" align="right">5,808</td>
            <!--More cells here-->
            <td headers="Y9 i0" align="right">4,272</td>
            <td headers="Y10 i0" align="right">4,955</td>
        </tr>
        <tr class="hr">
            <td colspan="12"></td>
        </tr>
        <tr>
            <th class="row_lbl" scope="row" id="i1">Gross Margin %</th>
            <td headers="Y0 i1" align="right">37.4</td>
            <td headers="Y1 i1" align="right">39.9</td>
            <!--More cells here-->
            <td headers="Y9 i1" align="right">23.4</td>
            <td headers="Y10 i1" align="right">33.5</td>
        </tr>
        <!--More rows here-->
        <tr class="hr">
            <td colspan="12"></td>
        </tr>
    </tbody>
</table>

I am looking to pull say the 2007 revenue data XPATH from the key ratios page by searching for the "Revenue" row then looking at the 2007 column.

XPATH Location of the 2007 revenue:

//*[@id="financials"]/table/tbody/tr[2]/td[1]

The tr[2] represents the row that Revenue is aligned to. However, if I have a program that looks at multiple stocks I want to make sure that tr[2] still looks at revenue.

I have tried multiple versions of the following XPATH which returns a NULL value. (I am using XPATH helper a google chrome extension)

//*[@id="financials"]/table/tbody/tr[contains(text(),'Revenue')]/td[1]

Outer html code for the Revenue row:

<th class="row_lbl" scope="row" id="i0">Revenue&nbsp;<span>USD Mil</span></th>

Outer html code for 2007 Revenue:

<td align="right" headers="Y0 i0" class="">5,858</td>

Update

Based on below answer I've written:

//*[@id='financials']//td[contains(@headers,'i0')][1]

Pulls the 2017 Revenue data 5,858


Solution

  • In "Financials" table, "Revenue" is a th, not a tr. You can get all cells in a column or row of that table by referencing the header attribute of td tags. Columns are Y0..Yn and rows are i0..in, for example:

    First column has header Y0:

    //*[@id='financials']//td[contains(@headers,'Y0')]

    First row has header i0:

    //*[@id='financials']//td[contains(@headers,'i0')]

    And so on