Search code examples
pythonbeautifulsoupurllib

Extract title from url with python


I want to use urllib to extract the title from the following html document. I have provided the beginning part below:

html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
      "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
  <title>Three Little Pigs</title>
  <meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">
</head>

<body>

I used urlopen in urllib.request but it seems like the url type in the html document does not allow me to extract anything.

I have tried:

from bs4 import BeautifulSoup
from urllib.request import urlopen
def get_title():
    soup = urlopen(html_doc)
    print(soup.title.string)
get_title()

I got the result of:

ValueError: unknown url type: '!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n      "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\n  <title>Three Little Pigs</title>\n  <meta name="generator" content="Amaya, see http://www.w3.org/Amaya/">\n</head>\n\n<body'

Can anyone help with this problem?


Solution

  • html_doc is not an URL, it's the actual source code string, you can use BeautifulSoup's html.parser to parse it and then extract the title from it:

    from bs4 import BeautifulSoup
    
    def get_title():
        soup = BeautifulSoup(html_doc, 'html.parser')
        print(soup.title.string)
    
    get_title()
    

    Output:

    Three Little Pigs