Search code examples
pythonhtmlregexparsingurllib

How to remove HTML Tags in Python3


I am writing a simple script to print out my IP Address in terminal. I am having trouble removing the HTML tags from the print statement.

I have tried using the .strip() function from the urllib library. I do not understand regex enough to input into this code.

import re
import urllib.request, urllib.parse, urllib.error
import json


data = urllib.request.urlopen('http://checkip.dyndns.org')
for line in data:
    print(line.decode().strip())

I expect the output to solely be my IP (xxx.xx.xx.xxx) but instead am getting the following

"< html>< head>< title>Current IP Check< /title>< /head>< body>Current IP Address: XXX.XX.XX.XXX< /body>< /html>"


Solution

  • If you want to use regex, instead of stripping tags you can just match the part you are interested in using parentheses, here's an example:

    import re
    import urllib.request
    
    
    data = urllib.request.urlopen('http://checkip.dyndns.org').read().decode()
    print(re.search(r'Current IP Address: ([\d\.]+)', data).group(1))
    

    You can find more info and examples at https://docs.python.org/2/library/re.html#match-objects

    For removing HTML tags in general you can use something like this using re:

    print(re.sub('<[^<]+?>', '', '<html>foo</html>'))
    

    Or even easier using BeatufilSoup instead of re:

    from bs4 import BeautifulSoup
    print(BeautifulSoup('<html>foo</html>').get_text())