I'm new to Scrapy and Python. I have been working to extract data from 2 websites and they work really well if I do it directly with python. I have investigated and I want to crawl these websites:
Can someone tell me how can I make the the second link work?
I see this message:
DEBUG: Crawled (200) allenproveedora.com.mx/> (referer: None) ['partial']
but I can't find out how to solve it.
I would appreciate any help and support. Here is the code and the log:
items.py
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
link = Field()
Test.py (spider folder)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["vallenproveedora.com.mx"]
#start_urls = ["http://www.homedepot.com.mx/webapp/wcs/stores/servlet/SearchDisplay?searchTermScope=&filterTerm=&orderBy=&maxPrice=&showResultsPage=true&langId=-5&beginIndex=0&sType=SimpleSearch&pageSize=&manufacturer=&resultCatEntryType=2&catalogId=10052&pageView=table&minPrice=&urlLangId=-5&storeId=13344&searchTerm=guante"]
start_urls = ["http://www.vallenproveedora.com.mx/"]
def parse(self, response):
titles = response.xpath('//ul/li')
for titles in titles:
title = titles.select("a/text()").extract()
link = titles.select("a/@href").extract()
print (title, link)
You're seeing ['partial']
in your logs because the server at vallenproveedora.com.mx doesn't set the Content-Length header in its response; run curl -I
to see for yourself. For more detail on the cause of the partial
flag, see my answer here.
However, you don't actually have to worry about this. The response body is all there and Scrapy will parse it. The problem you're really encountering is that there are no elements selected by the XPath //ul/li/a
. You should look at the page source and modify your selectors accordingly. I would recommend writing a specific spider for each site, because sites usually need different selectors.