Search code examples
pythonparsingsgml

Python: SGMLParser can't get line number


I wrote simple class which inherits SGMLParser. The main idea behind this class is to collect all the links from html page and print the line number where this link could be found.

The class looks like this:

class HtmlParser(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.links = []

    def start_a(self, attr):
        href = [v for k, v in attr if k == "href"]
        self.links.append(href[0])
        print(self.getpos())

The problem is that getpos() returns (1,0) on every link. So if run the following code:

parser = HtmlParser()
parser.feed('''
<!DOCTYPE html>
    <html>
        <head lang="en">
            <meta charset="UTF-8">
            <title></title>
        </head>
        <body>
            <a href="www.foo-bar.com"></a>
            <a href="http://foo.bar.com"></a>
            <a href="www.google.com"></a>
        </body>
    </html>''')
parser.close()
print(parser.links)

The output will be:

(1, 0)
(1, 0)
(1, 0)
['www.foo-bar.com', 'http://foo.bar.com', 'www.google.com']

The question: why I can't get the actual line number for the links?


Solution

  • You can't get the line number because sgmllib is broken.

    As an alternative you can use HTMLParser in a similar fashion:

    from HTMLParser import HTMLParser
    
    
    class MyHTMLParser(HTMLParser):
        def reset(self):
            HTMLParser.reset(self)
            self.links = []
    
        def handle_starttag(self, tag, attr):
            if tag == 'a':
                href = [v for k, v in attr if k == "href"]
                self.links.append(href[0])
                print(self.getpos())
    
    parser = MyHTMLParser()
    parser.feed('''
    <!DOCTYPE html>
        <html>
            <head lang="en">
                <meta charset="UTF-8">
                <title></title>
            </head>
            <body>
                <a href="www.foo-bar.com"></a>
                <a href="http://foo.bar.com"></a>
                <a href="www.google.com"></a>
            </body>
        </html>''')
    parser.close()
    print(parser.links)
    

    Which outputs the expected:

    (9, 12)
    (10, 12)
    (11, 12)
    ['www.foo-bar.com', 'http://foo.bar.com', 'www.google.com']