Alright, I am new to programming, and I figured the best way to learn would be to program something. Part of my job involves searching for a movie on IMDB and pasting the director, writer, (first four) actors, and a link to the IMDB page in an Excel spreadsheet.
My end goal is to have a CSV with the film title and year, and have the scraper take these variables from the CSV, search IMDB, pull the data, and export the data into a new CSV.
I have reading and researching for about a week. I have gone through the Scrapy tutorial successfully, but I'm having trouble going from there to the desired end.
How can I import values from a CSV into my spider script? I am thinking it would look something like this:
name = COLUMN1
year = COLUMN2
class imdb_spider(scrapy.Spider):
name = "imdb"
allowed_domains = ["imdb.com"]
start_urls = [
"http://www.imdb.com/find?ref_=nv_sr_fn&q=/(name)&(year)"
]
I am not sure how to pull from a CSV file though.
All the information I need would be on this last page: http://www.imdb.com/title/tt0081505/fullcredits?ref_=tt_ov_st_sm
Here is what I pulled using firebug:
Director:
<td class="name">
<a href="/name/nm0000040/?ref_=ttfc_fc_dr1"> Stanley Kubrick </a>
</td>
Writer:
<td class="name">
<a href="/name/nm0000040/?ref_=ttfc_fc_wr2"> Stanley Kubrick </a>
</td>
Actors (only need first four, if possible):
<td class="itemprop" itemtype="http://schema.org/Person" itemscope="" itemprop="actor">
<td class="ellipsis"> ... </td>
I am not sure how to define the page link itself.
After that, I just need to loop it over the whole list and save a new CSV with the data.
I know this is an intense question, and I'm not asking anyone to code it for me. I'm willing to put in the work if I know where to look/how to figure this out. I am reading through the Scrapy documentation, but it is still unclear.
If there is an obviously better way to do this than Python and Scrapy, let me know.
Thanks.
Edit: Mac OS x 10.10.1, Python 2.7, Scrapy 0.24.4, TextWrangler to edit
The csv module is quite handy, and also useful for tab separated files that have irregular/empty fields. (import csv)
with open('something_something_darkside.txt', 'rb') as f:
data = list(csv.reader(f,delimiter='\t'))
for row in data:
As far as webpages, I found methods of using Beautiful Soup to turn html to xml, and use xml parsers to extract what I needed. These methods may be outdated but still reliable.