Search code examples
pythonclasshtml-parser

dynamically extracting data from HTML page


I'm working on a script to extract some string/data from HTML document (Nagios status page, in this case) using this custom class:

## tagLister.py

from sgmllib import SGMLParser
class TAGLister(SGMLParser):

    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_td(self, attrs):
        CLS = [ v for k, v in attrs if k == 'class' ]
        if CLS:
            self.urls.extend(CLS)

Whenever a < td > tag is found, SGMLParser is called by start_td and look for the CLASS attribute.

>>> import urllib, tagLister
>>> usock = urllib.urlopen("http://www.someurl.com/test/test_page.html")
>>> parser = tagLister.TAGLister()
>>> parser.feed(usock.read())  
>>> for url in parser.urls: print url
>>> ...

The above lists all the values found in the <td> tag for the CLASS attributes. Is there any way to dynamically assign the td bit (in start_td) and class (as the value of k), so that using optparse, it can be assigned on the fly, like this:

tagLister.py -t td -k class

rather then coding it statically? I'm intended to [re]use this class for any tag (e.g. <a>, <div> etc.) and the associated attributes (e.g. href, id etc.) from the command-line. Any help would be greatly appreciated.


Solution

  • One option is to switch to lxml.html and use XPath - and the result of that will already be a list... (and since an XPath expression is just a string - it's easier to formulate than playing around with class inheritance)

    >>> tag = 'a'
    >>> attr = 'href'
    >>> xpq = '//{}/@{}'.format(tag, attr)
    >>> a = '<a href="test-or-something">hello</a><a>No href here</a><a href="something-else">blah</a>'
    >>> import lxml.html
    >>> lxml.html.fromstring(a).xpath(xpq)
    ['test-or-something', 'something-else']
    

    if you have to use stdlib - then you could do something similar with HTMLParser

    from HTMLParser import HTMLParser
    
    class ListTags(HTMLParser):
        def __init__(self, tag, attr):
            HTMLParser.__init__(self)
            self.tag = tag
            self.attr = attr
            self.matches = []
        def handle_starttag(self, tag, attrs):
             if tag == self.tag:
                ad = dict(attrs)
                if self.attr in ad:
                    self.matches.append(ad[self.attr])
    
    >>> lt = ListTags('a', 'href')
    >>> lt.feed(a)
    >>> lt.matches
    ['test-or-something', 'something-else']