I'm trying to scrape a specific website. The code I'm using to scrape it is the same as that being used to scrape many other sites successfully.
However, the resulting response.body
looks completely corrupt (segment below):
����)/A���(��Ե�e�)k�Gl�*�EI�
����:gh��x@����y�F$F�_��%+�\��r1��ND~l""�54بN�:�FA��W
b� �\�F�M��C�o.�7z�Tz|~0��̔HgA�\���[��������:*i�P��Jpdh�v�01]�Ӟ_e�b߇��,�X��E, ��냬�e��Ϣ�5�Ϭ�B<p�A��~�3t3'>N=`
And as a result it is impossible to parse.
What is really confusing is that if I run scrapy shell
on the same URL, everything works fine (the website's charset is utf-8)---which is leading me to believe this is caused by scrapyd.
I'd really appreciate any suggestions.
# -*- coding: utf-8 -*-
BOT_NAME = "[name]"
SPIDER_MODULES = ["[name].spiders"]
NEWSPIDER_MODULE = "[name].spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '[name] (+http://www.yourdomain.com)'
ROBOTSTXT_OBEY = False
CRAWLERA_MAX_CONCURRENT = 50
CONCURRENT_REQUESTS = CRAWLERA_MAX_CONCURRENT
CONCURRENT_REQUESTS_PER_DOMAIN = CRAWLERA_MAX_CONCURRENT
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DUPEFILTER_DEBUG = True
COOKIES_ENABLED = False # Disable cookies (enabled by default)
DEFAULT_REQUEST_HEADERS = {
"X-Crawlera-Profile": "desktop",
"X-Crawlera-Cookies": "disable",
"accept-encoding": "gzip, deflate, br",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 200,
"scrapy_crawlera.CrawleraMiddleware": 300,
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = "KEY"
ITEM_PIPELINES = {
"[name].pipelines.Export": 400,
}
# sentry dsn
SENTRY_DSN = "Key"
EXTENSIONS = {
"[name].extensions.SentryLogging": -1, # Load SentryLogging extension before others
}```
Thanks to Serhii's suggestion, I found that the issue was due to "accept-encoding": "gzip, deflate, br"
: I accepted compressed sites but did not handle them in scrapy.
Adding scrapy.downloadermiddlewares.httpcompression
or simply removing the accept-encoding
line fixes the issue.