I'm working on a script to extract some string/data from HTML document (Nagios status page, in this case) using this custom class:
## tagLister.py
from sgmllib import SGMLParser
class TAGLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_td(self, attrs):
CLS = [ v for k, v in attrs if k == 'class' ]
if CLS:
self.urls.extend(CLS)
Whenever a < td > tag is found, SGMLParser is called by start_td
and look for the CLASS
attribute.
>>> import urllib, tagLister
>>> usock = urllib.urlopen("http://www.someurl.com/test/test_page.html")
>>> parser = tagLister.TAGLister()
>>> parser.feed(usock.read())
>>> for url in parser.urls: print url
>>> ...
The above lists all the values found in the <td>
tag for the CLASS attributes.
Is there any way to dynamically assign the td
bit (in start_td
) and class
(as the value of k
), so that using optparse
, it can be assigned on the fly, like this:
tagLister.py -t td -k class
rather then coding it statically? I'm intended to [re]use this class for any tag (e.g. <a>
, <div>
etc.) and the associated attributes (e.g. href
, id
etc.) from the command-line. Any help would be greatly appreciated.
One option is to switch to lxml.html
and use XPath - and the result of that will already be a list... (and since an XPath expression is just a string - it's easier to formulate than playing around with class inheritance)
>>> tag = 'a'
>>> attr = 'href'
>>> xpq = '//{}/@{}'.format(tag, attr)
>>> a = '<a href="test-or-something">hello</a><a>No href here</a><a href="something-else">blah</a>'
>>> import lxml.html
>>> lxml.html.fromstring(a).xpath(xpq)
['test-or-something', 'something-else']
if you have to use stdlib - then you could do something similar with HTMLParser
from HTMLParser import HTMLParser
class ListTags(HTMLParser):
def __init__(self, tag, attr):
HTMLParser.__init__(self)
self.tag = tag
self.attr = attr
self.matches = []
def handle_starttag(self, tag, attrs):
if tag == self.tag:
ad = dict(attrs)
if self.attr in ad:
self.matches.append(ad[self.attr])
>>> lt = ListTags('a', 'href')
>>> lt.feed(a)
>>> lt.matches
['test-or-something', 'something-else']