Search code examples
pythonhtmlhtml-parsing

How to get page title in requests


What would be the simplest way to get the title of a page in Requests?

r = requests.get('http://www.imdb.com/title/tt0108778/')
# ? r.title
Friends (TV Series 1994–2004) - IMDb

Solution

  • You need an HTML parser to parse the HTML response and get the title tag's text:

    Example using lxml.html:

    >>> import requests
    >>> from lxml.html import fromstring
    >>> r = requests.get('http://www.imdb.com/title/tt0108778/')
    >>> tree = fromstring(r.content)
    >>> tree.findtext('.//title')
    u'Friends (TV Series 1994\u20132004) - IMDb'
    

    There are certainly other options, like, for example, mechanize library:

    >>> import mechanize
    >>> br = mechanize.Browser()
    >>> br.open('http://www.imdb.com/title/tt0108778/')
    >>> br.title()
    'Friends (TV Series 1994\xe2\x80\x932004) - IMDb'
    

    What option to choose depends on what are you going to do next: parse the page to get more data, or, may be, you want to interact with it: click buttons, submit forms, follow links etc.

    Besides, you may want to use an API provided by IMDB, instead of going down to HTML parsing, see:

    Example usage of an IMDbPY package:

    >>> from imdb import IMDb
    >>> ia = IMDb()
    >>> movie = ia.get_movie('0108778')
    >>> movie['title']
    u'Friends'
    >>> movie['series years']
    u'1994-2004'