How to extract budget, gross, metascore from imdb using scrapy and beautifulsoup?

I am staring with the url below:

http://www.imdb.com/chart/top

The structure of the HTML file seems to be so confusing:

" Metascore: "

I am trying to use a format like this:

movie['metascore'] = self.get_text(soup.find('h4', attrs={'&nbsp':'Metascore'}))

Solution

I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.

However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.

From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).

I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.

Good luck!