Search code examples
pythonhtmlcsvweb-scrapinghtml-table

How to scrape all values in html table including spans and a href


I have the following html code:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <table border="1" cellspacing="0" class="Quote xaltrow" id="MainContent_Quote1_Table1_Table1" style="border-collapse:collapse;border-collapse:collapse;">
        <thead>
            <tr class="xheader">
                <th>
                    <span>Sym</span> <a class="ToggleNames" href="/Analytics/MostActive.aspx">-Names</a> <span class="arrow"></span>
                </th>
                <th colspan="3">Bid - Ask</th>
                <th>Last <span class="arrow"></span></th>
                <th>Chg <span class="arrow"></span></th>
                <th>%Ch <span class="arrow"></span></th>
                <th>Vol <span class="arrow"></span></th>
                <th>$Vol <span class="arrow"></span></th>
                <th>#Tr <span class="arrow"></span></th>
                <th>Open-Hi-Lo</th>
                <th>Year Hi-Lo</th>
                <th>Last Tr</th>
                <th>News</th>
                <th>Delay</th>
            </tr>
        </thead>
        <tbody>
            <tr class="Upd UpdURHHBY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=RHHBY&amp;region=U">RHHBY</a> <span>- Q</span> <span class="Name">- ROCHE HLDG LTD SPONS</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">31.92</td>
                <td class=" xred UpdC">-0.31</td>
                <td class="xsmall UpdCP xred">-1.0</td>
                <td class="q-regright UpdV">851.0</td>
                <td class="q-smright UpdW">27,163</td>
                <td class="xsmall UpdT">1,461</td>
                <td class="xsmall xcentre"><span class="UpdO">32.03</span>&nbsp;&nbsp;<span class="UpdH">32.067</span>&nbsp;&nbsp;<span class="UpdI">31.84</span></td>
                <td class="xsmall xcentre">33.74&nbsp;&nbsp;27.09</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUNSRGY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=NSRGY&amp;region=U">NSRGY</a> <span>- Q</span> <span class="Name">- NESTLE SA REG SHRS S</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">76.07</td>
                <td class=" xred UpdC">-0.23</td>
                <td class="xsmall UpdCP xred">-0.3</td>
                <td class="q-regright UpdV">336.2</td>
                <td class="q-smright UpdW">25,574</td>
                <td class="xsmall UpdT">1,785</td>
                <td class="xsmall xcentre"><span class="UpdO">75.89</span>&nbsp;&nbsp;<span class="UpdH">76.07</span>&nbsp;&nbsp;<span class="UpdI">75.66</span></td>
                <td class="xsmall xcentre">83.00&nbsp;&nbsp;66.28</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUNTTYY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=NTTYY&amp;region=U">NTTYY</a> <span>- Q</span> <span class="Name">- NIPPON TELEGRAPH AND TELEPHONE C</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">43.90</td>
                <td class=" xred UpdC">-0.56</td>
                <td class="xsmall UpdCP xred">-1.3</td>
                <td class="q-regright UpdV">316.2</td>
                <td class="q-smright UpdW">13,883</td>
                <td class="xsmall UpdT">889</td>
                <td class="xsmall xcentre"><span class="UpdO">44.145</span>&nbsp;&nbsp;<span class="UpdH">44.15</span>&nbsp;&nbsp;<span class="UpdI">43.89</span></td>
                <td class="xsmall xcentre">44.57&nbsp;&nbsp;43.00</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUTCEHY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=TCEHY&amp;region=U">TCEHY</a> <span>- Q</span> <span class="Name">- TENCENT HOLDINGS ADR</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">29.63</td>
                <td class=" xgreen UpdC">+0.06</td>
                <td class="xsmall UpdCP xgreen">0.2</td>
                <td class="q-regright UpdV">380.1</td>
                <td class="q-smright UpdW">11,263</td>
                <td class="xsmall UpdT">1,341</td>
                <td class="xsmall xcentre"><span class="UpdO">29.65</span>&nbsp;&nbsp;<span class="UpdH">29.78</span>&nbsp;&nbsp;<span class="UpdI">29.60</span></td>
                <td class="xsmall xcentre">29.85&nbsp;&nbsp;19.74</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUATLKY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=ATLKY&amp;region=U">ATLKY</a> <span>- Q</span> <span class="Name">- ATLAS COPCO AB SER A</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">35.46</td>
                <td class=" xred UpdC">-0.23</td>
                <td class="xsmall UpdCP xred">-0.6</td>
                <td class="q-regright UpdV">316.2</td>
                <td class="q-smright UpdW">11,213</td>
                <td class="xsmall UpdT">209</td>
                <td class="xsmall xcentre"><span class="UpdO">35.74</span>&nbsp;&nbsp;<span class="UpdH">35.81</span>&nbsp;&nbsp;<span class="UpdI">35.46</span></td>
                <td class="xsmall xcentre">35.72&nbsp;&nbsp;23.58</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUVLKAY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=VLKAY&amp;region=U">VLKAY</a> <span>- Q</span> <span class="Name">- VOLKSWAGEN A G SPONS</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">29.15</td>
                <td class=" xred UpdC">-0.34</td>
                <td class="xsmall UpdCP xred">-1.2</td>
                <td class="q-regright UpdV">323.6</td>
                <td class="q-smright UpdW">9,432</td>
                <td class="xsmall UpdT">782</td>
                <td class="xsmall xcentre"><span class="UpdO">28.935</span>&nbsp;&nbsp;<span class="UpdH">29.25</span>&nbsp;&nbsp;<span class="UpdI">28.90</span></td>
                <td class="xsmall xcentre">33.60&nbsp;&nbsp;25.88</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUTMICY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=TMICY&amp;region=U">TMICY</a> <span>- Q</span> <span class="Name">- TREND MICRO ADR #</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">42.78</td>
                <td class=" xred UpdC">-0.64</td>
                <td class="xsmall UpdCP xred">-1.5</td>
                <td class="q-regright UpdV">210.6</td>
                <td class="q-smright UpdW">9,011</td>
                <td class="xsmall UpdT">155</td>
                <td class="xsmall xcentre"><span class="UpdO">42.905</span>&nbsp;&nbsp;<span class="UpdH">42.93</span>&nbsp;&nbsp;<span class="UpdI">42.78</span></td>
                <td class="xsmall xcentre">44.75&nbsp;&nbsp;32.04</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
            <tr class="Upd UpdUALIOY-">
                <td class="sym">
                    <a class="qn Name" href="/Quote/Detail.aspx?symbol=ALIOY&amp;region=U">ALIOY</a> <span>- Q</span> <span class="Name">- ACTELION LTD</span>
                </td>
                <td class="bac" colspan="3">no orders</td>
                <td class="UpdL">70.66</td>
                <td class=" xgreen UpdC">+0.06</td>
                <td class="xsmall UpdCP xgreen">0.1</td>
                <td class="q-regright UpdV">123.3</td>
                <td class="q-smright UpdW">8,715</td>
                <td class="xsmall UpdT">56</td>
                <td class="xsmall xcentre"><span class="UpdO">70.538</span>&nbsp;&nbsp;<span class="UpdH">70.70</span>&nbsp;&nbsp;<span class="UpdI">70.50</span></td>
                <td class="xsmall xcentre">70.89&nbsp;&nbsp;34.83</td>
                <td class="xsmall xcentre UpdE"></td>
                <td class="xsmall xcentre"></td>
                <td class="xsmall xcentre">realtime</td>
            </tr>
        </tbody>
    </table>
</body>
</html>

This is my code for grabbing the table:

import lxml.html

response = open('test.html')
html2 = response.read()
root = lxml.html.fromstring(html2)
for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
    cells = row.xpath('.//td/text()')
    print cells

This is the result:

['no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', u'\xa0\xa0', u'\xa0\xa0', u'33.74\xa0\xa027.09', 'realtime']
['no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', u'\xa0\xa0', u'\xa0\xa0', u'83.00\xa0\xa066.28', 'realtime']
['no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', u'\xa0\xa0', u'\xa0\xa0', u'44.57\xa0\xa043.00', 'realtime']
['no orders', '29.63', '+0.06', '0.2', '380.1', '11,263', '1,341', u'\xa0\xa0', u'\xa0\xa0', u'29.85\xa0\xa019.74', 'realtime']
['no orders', '35.46', '-0.23', '-0.6', '316.2', '11,213', '209', u'\xa0\xa0', u'\xa0\xa0', u'35.72\xa0\xa023.58', 'realtime']
['no orders', '29.15', '-0.34', '-1.2', '323.6', '9,432', '782', u'\xa0\xa0', u'\xa0\xa0', u'33.60\xa0\xa025.88', 'realtime']
['no orders', '42.78', '-0.64', '-1.5', '210.6', '9,011', '155', u'\xa0\xa0', u'\xa0\xa0', u'44.75\xa0\xa032.04', 'realtime']
['no orders', '70.66', '+0.06', '0.1', '123.3', '8,715', '56', u'\xa0\xa0', u'\xa0\xa0', u'70.89\xa0\xa034.83', 'realtime']

I would like it to be the following instead:

['RHHBY', 'ROCHE HLDG LTD SPONS', 'no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', '32.03', '32.067', '31.84', '33.74', '27.09', '', '', 'realtime']
['NSRGY', 'NESTLE SA REG SHRS S', 'no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', '75.89', '76.07', '75.66', '83.00', '66.28', '', '', 'realtime']
['NTTYY', 'NIPPON TELEGRAPH AND TELEPHONE C', 'no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', '44.145', '44.15', '43.89', '44.57', '43.00', 'realtime']
...

How do I get the values inside the <td>'s where there are spans and/or <a>'s?

This table can be very large. I would like it to be a fast script like above.

I am planning on taking this array and writing it to a database or CSV.


Solution

  • Don't know if this is specifically what you want but you could grab the text in the //td/span and a elements:

    import lxml.html
    
    response = open('test.html')
    html2 = response.read()
    root = lxml.html.fromstring(html2)
    for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
        cells=[]
        cells = row.xpath('.//td/a/text()')
        cells = cells + row.xpath('.//td/text()')
        cells = cells + row.xpath('.//td/span/text()')
    
        print(cells)
    

    To eliminate the formatting, you could use something like:

    print([c.replace('\xa0','') for c in cells])