I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought". For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the crawler.py) will be greatly appreciated. Thank you for your time!
Ok. Here is my code. As I said I am new to Python so the code might look quite stupid but it works to scrap the first page (6 items). I work mostly with Fortran or Matlab. I would love to learn Python systematically If I have time though.
# Code of my crawler.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )
def parse_item(self, response):
sel = Selector(response)
stuff = BetaItem()
isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
isbn10 = []
if len(isbn10R) > 0:
isbn10 = [(isbn10R[0].split(' '))[1]]
stuff['isbn10'] = isbn10
starsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/span/@title').extract()
stars = []
if len(starsR) > 0:
stars = [(starsR[0].split(' '))[0]]
stuff['stars'] = stars
reviewsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/a[contains(@href,"showViewpoints=1")]/text()').extract()
reviews = []
if len(reviewsR) > 0:
reviews = [(reviewsR[0].split(' '))[0]]
stuff['reviews'] = reviews
copsR = sel.xpath('//a[@class="sim-img-title"]/@href').extract()
ncops = len(copsR)
cops = [None] * ncops
if ncops > 0:
for idx, cop in enumerate(copsR):
cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
stuff['cops'] = cops
return stuff
So I understand you were able to scrape these "Customers Who Bought This Item Also Bought" product details. As you probably saw, these are within a ul
in a div
with class "shoveler-content":
<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
<a class="back-button" onclick="return false;" style="" href="#Back">
<div class="shoveler-content">
<ul tabindex="-1">
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
<div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
...
</div>
</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
</ul>
</div>
<a class="next-button" onclick="return false;" style="" href="#Next">
<span class="auiTestSprite s_shvlNext">...</span>
</a>
</div>
</div>
When you inspect your browser of choice's network activity (via Firebug or Chrome Inspect tool), when you click on the "next" button for next suggested products, you'll see an AJAX query to this sort of URL:
http://www.amazon.com
/gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
&pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
&shovelerName=purchase
(I'm using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)
What's in the id
query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next "next" click a user will probably make.
What do you get back from this AJAX query? Still within your browser's inspect tool, you'll see the response is of type application/json
, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:
<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
<a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
<div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div> Home Game: An Accidental Guide to Fatherhood
</a>
<div class="byline">
<span class="carat">›</span>
<a href="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a>
</div>
<div class="rating-price">
<span class="rating-stars">
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" name="B00261OOWQ">
<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
<span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
<span>4.1 out of 5 stars</span>
</span>
</a>
</span>
(<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
</span>
</span>
</div>
<div class="binding-platform"> Kindle Edition </div>
<div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div>
</div>
So you basically get what was in the original page section for suggested products earlier, in each <li>
from <div class="shoveler-content"><ul>
But how do you get those ASINs codes to append to the AJAX query's id
parameter?
Well, in the product page, you'll notice this section
<div id="purchaseSimsData"
class="sims-data" style="display:none"
data-baseAsin="B005CRQ2OE" data-featureId="pd_sim"
data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
data-wdg="ebooks_display_on_website" data-widgetName="purchase">
B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>
which looks like all the suggested products ASINs.
Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json
package, and then parse each HTML snippet to extract product info you want.