I'm trying to extract some text out of a poorly design web page for a project, and after a long research and learning python I came close to make it happen, but the web page is poorly designed and can't find the right regular expression to do it.
So here we have what I've accomplished. http://coj.uci.cu/24h/status.xhtml?username=Diego1149&abb=1006 out of the source code of this web page I want to get the whole line of the first instance of an accepted problem. So I thought of this
exprespatFinderTitle = re.compile('<table id="submission" class="volume">.*(<tr class=.*>.*<label class="AC">.*Accepted.*</label>.*</tr>).*</table>')
but what does this does is clipping up until the last <tr>
of the table. Can someone help me figure this out?
Im using Python 2.7 whit BeautifulSoup and urllib
Stick to BeautitfulSoup alone; regular expressions are not the tool for HTML parsing:
table = soup.find('table', id='submission')
accepted = table.tbody.find('label', class_='AC')
if accepted:
row = accepted.parent.parent # row with accepted column