Search code examples
pythonlinuxpython-2.7twistedhttp-proxy

Can somebody explain what this piece of twisted does?


 class ConnectProxy(Proxy):

    requestFactory = ConnectProxyRequest
    connectedRemote = None

   def requestDone(self, request):   
       if request.method == 'CONNECT' and self.connectedRemote is not None:  
           self.connectedRemote.connectedClient = self
       else:
           Proxy.requestDone(self, request)

What does self.connectedRemote.connectedClient = self do ?


Solution

  • So the original source of this code snippet is one of my github repositories that wraps the twisted HTTP proxy server protocol to also support the CONNECT method.

    The short answer is that this assigns self (the downstream protocol instance -- between the client and this twisted server) to a member of the upstream connection's protocol instance (to the remote https server), so that any data received from the upstream connection can be easily written back to the downstream client's transport.

    I have a longer explanation of the code, but it will help to have a basic understanding of the HTTP proxy protocol, so I'll try to set some context, but feel free to skip ahead if you already know how that works.

    A Brief Overview of HTTP Proxy GET and CONNECT

    In addition to being a request/response transport for communicating with web servers, the HTTP protocol can also be used to communicate through a "forward" proxy - to other servers - via some minor extensions to the protocol.

    A normal HTTP request can be made by establishing a TCP connection directly to a server, for example, specified by www.example.com:80. This HTTP request looks something like this:

    > GET /foo HTTP/1.1
    < HTTP/1.1 200 OK
    

    If you need to talk through an HTTP proxy, you will instead make a TCP connection to that proxy server (say, localhost:8080), and send that HTTP server the following specially formatted HTTP request:

    > GET http://www.example.com/foo http/1.1
    < HTTP/1.1 200 OK
    

    The proxy will do the DNS lookup for www.example.com, establish a TCP connection, and send the HTTP request all on your behalf. It will then stream the response body back to your client. Configuring your browser to use a specific proxy will cause it to implicitly rewrite all HTTP urls to go through that proxy server.

    To tie this back to the question a bit: Twisted ships a Proxy protocol that understands these HTTP proxy GET requests. This works pretty well without problems for regular browsers talking HTTP.

    Now, with TLS in the picture, we want to stop the proxy from sniffing our traffic that is supposed to be secured. This is why browsers, instead of sending an https request via the proxy like this:

    > GET https://www.example.com/foo http/1.1
    < HTTP/1.1 200 OK
    

    instead use the CONNECT HTTP proxy extension method. This method, if the proxy supports it, will put the connection into a "tunnel" or pass-through mode. These requests look something like this:

    > CONNECT www.example.com:443 http/1.1
    < HTTP/1.1 200 CONNECT OK
    

    In this mode, instead of handling all of the dance of DNS, TCP, (TLS), and HTTP on behalf of the client, the proxy only does the first two, DNS and TCP. If the proxy successfully establishes a TCP connection to the example.com IP, it will then start forwarding all bytes sent by the client to the server, and all bytes received from the server back to the client, as-is. This allows the client and server to perform the subsequent TLS handshake and HTTP request without realizing (or caring) that there is a proxy server in-between.

    As-is, the twisted.web.proxy module does not implement the CONNECT method. It will instead return an HTTP 501 Not Implemented error for requests with this method, causing any connected browsers to fail to load assets over https.

    twisted-connect-proxy tries to fill this gap, by subclassing the existing twisted classes from twisted.web.proxy, but implementing support for the CONNECT method.

    A Description of the Code

    Proxies can be bit hairy to reason about, but the trick is to keep in mind that for every request there are 2 connections in play: The "downstream" connection between the client (e.g., browser) that connected to the twisted web proxy server, and the "upstream" connection, from the twisted web proxy server to the remote HTTP(s) server.

    Here is the code in question again:

    class ConnectProxy(Proxy):
    
        requestFactory = ConnectProxyRequest
        connectedRemote = None
    
        def requestDone(self, request):   
            if request.method == 'CONNECT' and self.connectedRemote is not None:  
                self.connectedRemote.connectedClient = self
            else:
                Proxy.requestDone(self, request)
    

    We'll go section by section:

    class ConnectProxy(Proxy):
    

    Proxy here is twisted.web.proxy.Proxy, a twisted.internet.Protocol subclass that implements an HTTP request handler for HTTP servers. We subclass it into a new Protocol, called ConnectProxy.

    requestFactory = ConnectProxyRequest
    connectedRemote = None
    

    The parent class, twisted.web.proxy.Proxy is a really basic subclass of the http request handler protocol, twisted.web.http.HTTPChannel. It normally defines the requestFactory attribute as the twisted.web.proxy.ProxyRequest class. But since that request handler doesn't support the CONNECT method, we use our own subclass defined later in the file, called ConnectProxyRequest. A better name might have been ConnectOrGetProxyRequest, but oh well.

    connectedRemote = None is just a shorthand way of getting the member access on all instances of this server protocol to default to None. For CONNECT requests made against this protocol instance, this variable will be assigned an instance of the ConnectProxyClient Protocol (defined later in the source) on a successful upstream connection. ConnectProxyClient is responsible for forwarding dataReceived from the upstream connection (with the remote server) back to the downstream client that connected to this server, via this protocol instance.

    def requestDone(self, request):   
    

    requestDone is called by the guts of HTTPChannel, when all the HTTP response (headers and body) from this server has been sent to the downstream client. For GET requests, this concludes the response from the upstream HTTP server. For CONNECT requests, this concludes the upstream TCP connection handshake succeeding or failing.

    if request.method == 'CONNECT' and self.connectedRemote is not None:  
        self.connectedRemote.connectedClient = self
    

    Per the RFC, when the (proxy) server finishes its 200 success response/body in response to a CONNECT, the connection switches to pass-through mode.

    On a successful downstream (to the remote server) connection, self.connectedRemote will be set to the pass-through protocol for that connection, instead of None. In that case, we give it a reference to this protocol so that it can easily send its data back to the upstream client (e.g., browser).

    If the upstream connection failed, or this was a GET proxy request that we finished streaming the response body for, we fall into the else: case

    else:
        Proxy.requestDone(self, request)
    

    This calls the parent class's requestDone. This is important for the HTTP server to properly implement HTTP in regards to properly closing or maintaining persistent "keepalive" connections with the downstream client.