Search code examples
pythonscrapycharacter-encodingweb-crawlerscrapyd

Scrapyd corrupting response?


I'm trying to scrape a specific website. The code I'm using to scrape it is the same as that being used to scrape many other sites successfully.

However, the resulting response.body looks completely corrupt (segment below):

����)/A���(��Ե�e�)k�Gl�*�EI�
                             ����:gh��x@����y�F$F�_��%+�\��r1��ND~l""�54بN�:�FA��W
b� �\�F�M��C�o.�7z�Tz|~΢0��̔HgA�\���[��������:*i�P��Jpdh�v�01]�Ӟ_e�b߇��,�X��E, ��냬�e��Ϣ�5�Ϭ�B<p�A��~�3t3'>N=`

And as a result it is impossible to parse.

What is really confusing is that if I run scrapy shell on the same URL, everything works fine (the website's charset is utf-8)---which is leading me to believe this is caused by scrapyd.

I'd really appreciate any suggestions.

SETTINGS.py

# -*- coding: utf-8 -*-

BOT_NAME = "[name]"

SPIDER_MODULES = ["[name].spiders"]
NEWSPIDER_MODULE = "[name].spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = '[name] (+http://www.yourdomain.com)'

ROBOTSTXT_OBEY = False

CRAWLERA_MAX_CONCURRENT = 50
CONCURRENT_REQUESTS = CRAWLERA_MAX_CONCURRENT
CONCURRENT_REQUESTS_PER_DOMAIN = CRAWLERA_MAX_CONCURRENT

AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
DUPEFILTER_DEBUG = True

COOKIES_ENABLED = False  # Disable cookies (enabled by default)

DEFAULT_REQUEST_HEADERS = {
    "X-Crawlera-Profile": "desktop",
    "X-Crawlera-Cookies": "disable",
    "accept-encoding": "gzip, deflate, br",
}

DOWNLOADER_MIDDLEWARES = {
    "scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 200,
    "scrapy_crawlera.CrawleraMiddleware": 300,
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = "KEY"

ITEM_PIPELINES = {
    "[name].pipelines.Export": 400,
}
# sentry dsn
SENTRY_DSN = "Key"

EXTENSIONS = {
    "[name].extensions.SentryLogging": -1,  # Load SentryLogging extension before others
}```

Solution

  • Thanks to Serhii's suggestion, I found that the issue was due to "accept-encoding": "gzip, deflate, br": I accepted compressed sites but did not handle them in scrapy.

    Adding scrapy.downloadermiddlewares.httpcompression or simply removing the accept-encoding line fixes the issue.