I am trying to code a scraper with Scrapy for Python. At this point, I am trying to get the name of the webpage and all the outbound links within the page. The output should be a dictionary like this
{'link': [u'Link1'], 'title': [u'Page title']}
I have created this code:
from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
def parse(self, response):
items =[]
for link in response.xpath("//a"):
item = SocialMediaItem()
item['title'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
items.append(item)
yield items
Could anyone help me to get this result? I've adapted the code from this page http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/
updating the code without the deprecated functions. Thank you so much!
Dani
If I understand correctly, you want to iterate all of the links and extract links and titles.
Get all a
tags via //a
xpath and extract text()
and @href
:
def parse(self, response):
for link in response.xpath("//a"):
item = SocialMediaItem()
item['title'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
yield item
This yields:
{'link': [u'#mw-navigation'], 'title': [u'navigation']}
{'link': [u'#p-search'], 'title': [u'search']}
...
{'link': [u'/wiki/Internet_forum'], 'title': [u'Internet forums']}
...
Also, note that there are Link Extractors
built-in into Scrapy:
LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.