Search code examples
pythonbeautifulsoupimdb

Parsing IMDB with BeautifulSoup


I've stripped the following code from IMDB's mobile site using BeautifulSoup, with Python 2.7.

I want to create a separate object for the episode number '1', title 'Winter is Coming', and IMDB score '8.9'. Can't seem to figure out how to split apart the episode number and the title.

   <a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
     <span class="text-large">
      1.
      <strong>
       Winter Is Coming
      </strong>
     </span>
     <br/>
     <span class="mobile-sprite tiny-star">
     </span>
     <strong>
      8.9
     </strong>
     17 Apr. 2011
    </a>

Solution

  • You can use find to locate the span with the class text-large to the specific element you need.

    Once you have your desired span, you can use next to grab the next line, containing the episode number and find to locate the strong containing the title

    html = """
    <a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
         <span class="text-large">
          1.
          <strong>
           Winter Is Coming
          </strong>
         </span>
         <br/>
         <span class="mobile-sprite tiny-star">
         </span>
         <strong>
          8.9
         </strong>
         17 Apr. 2011
        </a>
    """
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html)
    span = soup.find('span', attrs={'text-large'})
    ep = str(span.next).strip()
    title = str(span.find('strong').text).strip()
    
    print ep
    print title
    
    > 1. 
    > Winter Is Coming