Search code examples
pythonhtmlpython-2.7html-parsinghref

Get return value from HTMLParser class to main class


Here my current code:

HTMLParser class:

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, value in attrs:
                if name == "href":
                    print value

Main class:

html = urllib2.urlopen(url).read()
MyHTMLParser().feed(html)

TODO: Any idea to make "value" can be return to main class? Thank for advance.


Solution

  • You store information you want to collect on your parser instance:

    class MyHTMLParser(HTMLParser):
        def __init__(self):
             HTMLParser.__init__()
             self.links = []
    
        def handle_starttag(self, tag, attrs):
            if tag == "a" and 'href' in attrs:
                self.links.append(attrs['href'])
    

    then after you have fed HTML into the parser you can retrieve the links attribute from the instance

    parser = MyHTMLParser()
    parser.feed(html)
    print parser.links
    

    For parsing HTML, I can heartily recommend you look at BeautifulSoup instead:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html)
    links = [a['href'] for a in soup.find_all('a', href=True)]