Search code examples
pythonscrapy

Instruct Scrapy to not automatically add `Content-Length` header if already exists


I have a situation where a website fingerprints based off header order and casing.

I've been able to specify header order with correct case by:

import json
from scrapy.spiders import Spider
from scrapy.http import Request


from twisted.web.http_headers import Headers as TwistedHeaders

class Test(Spider):
    name = 'test'
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'aA': 'a',
            'Bb': 'b',
            'CC': 'c',
            'Content-Length': '14',
            'dD': 'd',
        },
    }
    
    # Preserve casing of headers
    TwistedHeaders._caseMappings[b'aa'] = b'aA'
    TwistedHeaders._caseMappings[b'bb'] = b'Bb'
    TwistedHeaders._caseMappings[b'cc'] = b'CC'
    TwistedHeaders._caseMappings[b'dd'] = b'dD'

    def start_requests(self):
        yield Request(
            'https://httpbin.org/post',
            body=json.dumps({'foo': 'bar'}),
            method='POST',
            # Sniff with Fiddler
            # meta={'proxy': 'https://127.0.0.1:8866'}
        )
    
    def parse(self, response): pass

I notice in Fiddler that when I run the spider another Content-Length is present at the start of the request headers:

Fiddler inspection of request

I've tried to find where in Scrapy/Twisted this is being set, but as I am pretty new it is a lot to read through. As a result, I am having a hard time understanding why this is happening.

Is there anyway to instruct Content-Length to not be added automatically if it's already present? Or, if it is automatically added, for Content-Length to respect header order?

I know that if I remove Content-Length, the request works; however, it is still unordered (Content-Length occurs as the first key in the headers). For my use case, I think Content-Length must occur in the right spot. For the case of this example, that's between CC and dD.

I would appreciate any steps in the right direction. Thank you!


Solution

  • I was able to sort (alphabetically) and make case sensitive scrapy headers (including Content-Length) by:

    1. ORDER: Creating a custom downloader which sets headers as sorted alphabetically
    2. CASE SENSITIVE: Modifying _caseMappings of internal Twisted Headers class to allow case sensitive headers
    3. Two "Content-Length" headers: modify Twisted web/_newClient.py _writeToBodyProducerContentLength method (found here) to go from
    def _writeToBodyProducerContentLength(self, transport):
    -    self._writeHeaders(
    -         transport,
    -         networkString("Content-Length: %d\r\n" % (self.bodyProducer.length,)),
    -     )
    +    self._writeHeaders(transport, None)
    

    My github repository code can be found here