I have the following html code:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table border="1" cellspacing="0" class="Quote xaltrow" id="MainContent_Quote1_Table1_Table1" style="border-collapse:collapse;border-collapse:collapse;">
<thead>
<tr class="xheader">
<th>
<span>Sym</span> <a class="ToggleNames" href="/Analytics/MostActive.aspx">-Names</a> <span class="arrow"></span>
</th>
<th colspan="3">Bid - Ask</th>
<th>Last <span class="arrow"></span></th>
<th>Chg <span class="arrow"></span></th>
<th>%Ch <span class="arrow"></span></th>
<th>Vol <span class="arrow"></span></th>
<th>$Vol <span class="arrow"></span></th>
<th>#Tr <span class="arrow"></span></th>
<th>Open-Hi-Lo</th>
<th>Year Hi-Lo</th>
<th>Last Tr</th>
<th>News</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr class="Upd UpdURHHBY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=RHHBY&region=U">RHHBY</a> <span>- Q</span> <span class="Name">- ROCHE HLDG LTD SPONS</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">31.92</td>
<td class=" xred UpdC">-0.31</td>
<td class="xsmall UpdCP xred">-1.0</td>
<td class="q-regright UpdV">851.0</td>
<td class="q-smright UpdW">27,163</td>
<td class="xsmall UpdT">1,461</td>
<td class="xsmall xcentre"><span class="UpdO">32.03</span> <span class="UpdH">32.067</span> <span class="UpdI">31.84</span></td>
<td class="xsmall xcentre">33.74 27.09</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUNSRGY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=NSRGY&region=U">NSRGY</a> <span>- Q</span> <span class="Name">- NESTLE SA REG SHRS S</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">76.07</td>
<td class=" xred UpdC">-0.23</td>
<td class="xsmall UpdCP xred">-0.3</td>
<td class="q-regright UpdV">336.2</td>
<td class="q-smright UpdW">25,574</td>
<td class="xsmall UpdT">1,785</td>
<td class="xsmall xcentre"><span class="UpdO">75.89</span> <span class="UpdH">76.07</span> <span class="UpdI">75.66</span></td>
<td class="xsmall xcentre">83.00 66.28</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUNTTYY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=NTTYY&region=U">NTTYY</a> <span>- Q</span> <span class="Name">- NIPPON TELEGRAPH AND TELEPHONE C</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">43.90</td>
<td class=" xred UpdC">-0.56</td>
<td class="xsmall UpdCP xred">-1.3</td>
<td class="q-regright UpdV">316.2</td>
<td class="q-smright UpdW">13,883</td>
<td class="xsmall UpdT">889</td>
<td class="xsmall xcentre"><span class="UpdO">44.145</span> <span class="UpdH">44.15</span> <span class="UpdI">43.89</span></td>
<td class="xsmall xcentre">44.57 43.00</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUTCEHY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=TCEHY&region=U">TCEHY</a> <span>- Q</span> <span class="Name">- TENCENT HOLDINGS ADR</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">29.63</td>
<td class=" xgreen UpdC">+0.06</td>
<td class="xsmall UpdCP xgreen">0.2</td>
<td class="q-regright UpdV">380.1</td>
<td class="q-smright UpdW">11,263</td>
<td class="xsmall UpdT">1,341</td>
<td class="xsmall xcentre"><span class="UpdO">29.65</span> <span class="UpdH">29.78</span> <span class="UpdI">29.60</span></td>
<td class="xsmall xcentre">29.85 19.74</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUATLKY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=ATLKY&region=U">ATLKY</a> <span>- Q</span> <span class="Name">- ATLAS COPCO AB SER A</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">35.46</td>
<td class=" xred UpdC">-0.23</td>
<td class="xsmall UpdCP xred">-0.6</td>
<td class="q-regright UpdV">316.2</td>
<td class="q-smright UpdW">11,213</td>
<td class="xsmall UpdT">209</td>
<td class="xsmall xcentre"><span class="UpdO">35.74</span> <span class="UpdH">35.81</span> <span class="UpdI">35.46</span></td>
<td class="xsmall xcentre">35.72 23.58</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUVLKAY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=VLKAY&region=U">VLKAY</a> <span>- Q</span> <span class="Name">- VOLKSWAGEN A G SPONS</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">29.15</td>
<td class=" xred UpdC">-0.34</td>
<td class="xsmall UpdCP xred">-1.2</td>
<td class="q-regright UpdV">323.6</td>
<td class="q-smright UpdW">9,432</td>
<td class="xsmall UpdT">782</td>
<td class="xsmall xcentre"><span class="UpdO">28.935</span> <span class="UpdH">29.25</span> <span class="UpdI">28.90</span></td>
<td class="xsmall xcentre">33.60 25.88</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUTMICY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=TMICY&region=U">TMICY</a> <span>- Q</span> <span class="Name">- TREND MICRO ADR #</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">42.78</td>
<td class=" xred UpdC">-0.64</td>
<td class="xsmall UpdCP xred">-1.5</td>
<td class="q-regright UpdV">210.6</td>
<td class="q-smright UpdW">9,011</td>
<td class="xsmall UpdT">155</td>
<td class="xsmall xcentre"><span class="UpdO">42.905</span> <span class="UpdH">42.93</span> <span class="UpdI">42.78</span></td>
<td class="xsmall xcentre">44.75 32.04</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
<tr class="Upd UpdUALIOY-">
<td class="sym">
<a class="qn Name" href="/Quote/Detail.aspx?symbol=ALIOY&region=U">ALIOY</a> <span>- Q</span> <span class="Name">- ACTELION LTD</span>
</td>
<td class="bac" colspan="3">no orders</td>
<td class="UpdL">70.66</td>
<td class=" xgreen UpdC">+0.06</td>
<td class="xsmall UpdCP xgreen">0.1</td>
<td class="q-regright UpdV">123.3</td>
<td class="q-smright UpdW">8,715</td>
<td class="xsmall UpdT">56</td>
<td class="xsmall xcentre"><span class="UpdO">70.538</span> <span class="UpdH">70.70</span> <span class="UpdI">70.50</span></td>
<td class="xsmall xcentre">70.89 34.83</td>
<td class="xsmall xcentre UpdE"></td>
<td class="xsmall xcentre"></td>
<td class="xsmall xcentre">realtime</td>
</tr>
</tbody>
</table>
</body>
</html>
This is my code for grabbing the table:
import lxml.html
response = open('test.html')
html2 = response.read()
root = lxml.html.fromstring(html2)
for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
cells = row.xpath('.//td/text()')
print cells
This is the result:
['no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', u'\xa0\xa0', u'\xa0\xa0', u'33.74\xa0\xa027.09', 'realtime']
['no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', u'\xa0\xa0', u'\xa0\xa0', u'83.00\xa0\xa066.28', 'realtime']
['no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', u'\xa0\xa0', u'\xa0\xa0', u'44.57\xa0\xa043.00', 'realtime']
['no orders', '29.63', '+0.06', '0.2', '380.1', '11,263', '1,341', u'\xa0\xa0', u'\xa0\xa0', u'29.85\xa0\xa019.74', 'realtime']
['no orders', '35.46', '-0.23', '-0.6', '316.2', '11,213', '209', u'\xa0\xa0', u'\xa0\xa0', u'35.72\xa0\xa023.58', 'realtime']
['no orders', '29.15', '-0.34', '-1.2', '323.6', '9,432', '782', u'\xa0\xa0', u'\xa0\xa0', u'33.60\xa0\xa025.88', 'realtime']
['no orders', '42.78', '-0.64', '-1.5', '210.6', '9,011', '155', u'\xa0\xa0', u'\xa0\xa0', u'44.75\xa0\xa032.04', 'realtime']
['no orders', '70.66', '+0.06', '0.1', '123.3', '8,715', '56', u'\xa0\xa0', u'\xa0\xa0', u'70.89\xa0\xa034.83', 'realtime']
I would like it to be the following instead:
['RHHBY', 'ROCHE HLDG LTD SPONS', 'no orders', '31.92', '-0.31', '-1.0', '851.0', '27,163', '1,461', '32.03', '32.067', '31.84', '33.74', '27.09', '', '', 'realtime']
['NSRGY', 'NESTLE SA REG SHRS S', 'no orders', '76.07', '-0.23', '-0.3', '336.2', '25,574', '1,785', '75.89', '76.07', '75.66', '83.00', '66.28', '', '', 'realtime']
['NTTYY', 'NIPPON TELEGRAPH AND TELEPHONE C', 'no orders', '43.90', '-0.56', '-1.3', '316.2', '13,883', '889', '44.145', '44.15', '43.89', '44.57', '43.00', 'realtime']
...
How do I get the values inside the <td>
's where there are spans and/or <a>
's?
This table can be very large. I would like it to be a fast script like above.
I am planning on taking this array and writing it to a database or CSV.
Don't know if this is specifically what you want but you could grab the text in the //td/span
and a
elements:
import lxml.html
response = open('test.html')
html2 = response.read()
root = lxml.html.fromstring(html2)
for row in root.xpath('//*[@id="MainContent_Quote1_Table1_Table1"]/tbody/tr'):
cells=[]
cells = row.xpath('.//td/a/text()')
cells = cells + row.xpath('.//td/text()')
cells = cells + row.xpath('.//td/span/text()')
print(cells)
To eliminate the formatting, you could use something like:
print([c.replace('\xa0','') for c in cells])