Search code examples
pythonproxyscrapyprivoxy

How to use Privoxy and Tor for a Scrapy project


I'm trying to scrape from http://www.apkmirror.com, but currently I'm not able to access the site anymore in my browser because its says the owner banned my IP address (see below).

enter image description here

I'm trying to get around this by using Privoxy and Tor, similar to what is described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/.

Firstly, I installed an started Privoxy, which by default listens at port 8118. I've added the following line to /etc/privoxy/config:

forward-socks5   /               127.0.0.1:9050 .

I also have Tor running, which is listening at port 9050 as verified using

kurt@kurt-ThinkPad:~$ netstat -tulnp | grep 9050
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:9050          0.0.0.0:*               LISTEN      - 

As far as I can tell using wget, it is working. For example, if I wget apkmirror.com using a proxy I get a response:

kurt@kurt-ThinkPad:~$ wget www.apkmirror.com -e use_proxy=yes -e http_proxy=127.0.0.1:8118
--2017-04-24 11:02:32--  http://www.apkmirror.com/
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 185097 (181K) [text/html]
Saving to: ‘index.html.2’

index.html.2        100%[===================>] 180,76K  --.-KB/s    in 0,004s  

2017-04-24 11:02:44 (42,7 MB/s) - ‘index.html.2’ saved [185097/185097]

whereas without the proxy I get ERROR 403: Forbidden:

kurt@kurt-ThinkPad:~$ wget www.apkmirror.com
--2017-04-24 11:01:24--  http://www.apkmirror.com/
Resolving www.apkmirror.com (www.apkmirror.com)... 104.19.134.58, 104.19.136.58, 104.19.133.58, ...
Connecting to www.apkmirror.com (www.apkmirror.com)|104.19.134.58|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-04-24 11:01:24 ERROR 403: Forbidden.

Now for the Python code. I've written the following (simplified) spider:

import scrapy

DEBUG = True

class TorSpider(scrapy.spiders.SitemapSpider):
    name = "tor-spider"

    sitemap_urls = ['https://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]

    if DEBUG:
        custom_settings = {'CLOSESPIDER_PAGECOUNT': 20}

    def parse(self, response):
        item = {'url': response.url}
        yield item

I've also added the following lines to settings.py:

import os
os.environ['http_proxy'] = "http://localhost:8118"

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
}

from what I understand from https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy, if I set the http_proxy environment variable the HttpProxyMiddleware should work. However, if I try to scrape using the command

scrapy crawl tor-spider -o test.json

I get the following response:

2017-04-24 10:59:17 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: proxy_spider)
2017-04-24 10:59:17 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'proxy_spider.spiders', 'FEED_URI': 'test.json', 'SPIDER_MODULES': ['proxy_spider.spiders'], 'BOT_NAME': 'proxy_spider', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}

2017-04-24 10:59:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-04-24 10:59:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-24 10:59:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-24 10:59:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-24 10:59:18 [scrapy.core.engine] INFO: Spider opened
2017-04-24 10:59:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-24 10:59:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-24 10:59:18 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.apkmirror.com/robots.txt> (referer: None)
2017-04-24 10:59:18 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.apkmirror.com/sitemap_index.xml> (referer: None)
2017-04-24 10:59:18 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.apkmirror.com/sitemap_index.xml>: HTTP status code is not handled or not allowed
2017-04-24 10:59:18 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-24 10:59:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 519,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 3110,
 'downloader/response_count': 2,
 'downloader/response_status_count/403': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 24, 8, 59, 18, 927878),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 24, 8, 59, 18, 489419)}
2017-04-24 10:59:18 [scrapy.core.engine] INFO: Spider closed (finished)

In short, I'm still getting the 403 error with the scraper despite trying to scrape anonymously using Privoxy/Tor. Am I doing something wrong?


Solution

  • akpmirror is using cloudflare to protect themselves (among other things) against scraping and bots.

    Most probably they have scrapy's standard user agent blacklisted. So in addition to using a tor IP (which btw can also be easily blacklisted) you should also set a user agent header that looks like a real browser:

    in settings.py

    USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0"
    

    (see https://doc.scrapy.org/en/latest/topics/settings.html#user-agent for details)