Search code examples
pythonbashparsingscraper

Python 3 HTML parser


I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

All I have in python3 so far is:

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!


Solution

  • This is based off of larsmans's answer, above.

    f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
    for line in f:
        if b'align="center">' in line:
            print(next(f).decode().rstrip())
    f.close()
    

    Explanation:

    for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.

    if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)

    next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.