python sockets http raw-sockets python-sockets

Python Sockets Download jpg over HTTP

I am witnessing very strange behavior from my Python script. I am using Python sockets to download an image from the web. I am not interested in using requests/urllib. When I try to download the image, it downloads successfully. However, when going to open the file in the Photos app, Windows spits back a "It looks like we don't support this file format" error.

This is where the strange part starts. If I copy and paste the URL that my socket is reaching out to (the one used to download the image, in this case http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg) and download it myself from Chrome, and then run my script again, the image downloads and displays no problem! Also the number for Content-Length in the HTTP response headers increases. I have done this 3 times with 3 different images and it has given me the same behavior each time. Below is two runs of my script, one before I downloaded the file from Chrome and one after. Notice in the first run the Content-Length header states that there are 2564 bytes in the body of the response. In the second run, this number changes to 3833. They are both requesting the same URL.

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:58:24 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nAccept-Ranges: bytes\r\nLast-Modified: Sun, 12 Aug 2018 02:06:23 GMT\r\nX-Original-Content-Length: 25378\r\nX-Content-Type-Options: nosniff\r\nExpires: Sun, 12 Aug 2018 02:11:23 GMT\r\nCache-Control: max-age=300,private\r\nContent-Length: 2564\r\nConnection: close\r\nContent-Type: image/webp\r\n\r\nRIFF\xfc\t\...<hex data here>...\x00\x00'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:58:24 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
Accept-Ranges: bytes
Last-Modified: Sun, 12 Aug 2018 02:06:23 GMT
X-Original-Content-Length: 25378
X-Content-Type-Options: nosniff
Expires: Sun, 12 Aug 2018 02:11:23 GMT
Cache-Control: max-age=300,private
Content-Length: 2564
Connection: close
Content-Type: image/webp

IMAGE BINARY DATA SPLIT OFF
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

Bytes in image data: 2581

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:59:08 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nX-Content-Type-Options: nosniff\r\nAccept-Ranges: bytes\r\nExpires: Mon, 12 Aug 2019 04:58:50 GMT\r\nCache-Control: max-age=31536000\r\nEtag: W/"0"\r\nLast-Modified: Sun, 12 Aug 2018 04:58:50 GMT\r\nX-Original-Content-Length: 25378\r\nContent-Length: 3833\r\nConnection: close\r\nContent-Type: image/jpeg\r\n\r\n\xff\xd8\...<hex data here>...\xff\xd9'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:59:08 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Expires: Mon, 12 Aug 2019 04:58:50 GMT
Cache-Control: max-age=31536000
Etag: W/"0"
Last-Modified: Sun, 12 Aug 2018 04:58:50 GMT
X-Original-Content-Length: 25378
Content-Length: 3833
Connection: close
Content-Type: image/jpeg

IMAGE BINARY DATA SPLIT OFF
b'\xff\xd8\...<hex data here>...\xff\xd9'

Bytes in image data: 3850

Here is my code

class MySocket:

    def __init__(self, sock=None):
        if sock is None:
            self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        else:
            self.sock = sock

    def connect(self, host, port):
        self.sock.connect((host, port))

    def myclose(self):
        self.sock.close()

    def mysend(self, msg, debug=False):
        if debug:
            print("MESSAGE SENT")
            print(msg.decode())
        self.sock.sendall(msg)

    def myreceive(self, debug=False):
        received = b''
        buffer = 1
        while True:
            part = self.sock.recv(buffer)
            received += part
            if part == b'':
                break
        if debug:
            print("Received...")
            print(received)
        return received

def download_image(img_url):
    """
    Download images with the given socket and list of urls
    :param img_url: url corresponding to an image
    :return: None
    """
    image_socket = MySocket()
    image_socket.connect("www.rit.edu", 80)
    message = "GET " + img_url + " HTTP/1.1\r\n" \
              "Host: www.rit.edu\r\n" \
              "Accept: image/webp,image/apng,image/*,*/*;q=0.8\r\n" \
              "Accept-Language: en-US,en;q=0.9\r\n" \
              "Accept-Encoding: gzip, deflate\r\n\r\n"
    image_socket.mysend(message.encode(), True)
    reply = image_socket.myreceive()
    print("ENTIRE MESSAGE RECEIVED")
    print(reply)
    print()
    headers = reply.split(b'\r\n\r\n')[0]

    print("RESPONSE HEADERS SPLIT OFF")
    print(headers.decode())
    image = reply[len(headers)+4:]
    print()

    print("IMAGE BINARY DATA SPLIT OFF")
    print(image)
    print()
    print("Bytes in image data:", sys.getsizeof(image))
    print()
    # print(type(image))
    img_name = str(len(os.listdir("D:\\Documents\\School\\RIT\\Classes\\Summer 2018\\CSEC 380\\Homework\\3\\Script\\act1step2images"))) + img_url[-4:]
    f = open(os.path.join("D:\\Documents\\School\\RIT\\Classes\\Summer 2018\\CSEC 380\\Homework\\3\\Script\\act1step2images", img_name), 'wb')
    f.write(image)
    f.close()

def main():
    download_image("http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg")

main()

Can anyone tell me what is going on and why the jpg does not download on the first try?

Solution

This is part of the request you sent:

Accept: image/webp,image/apng,image/*,*/*;q=0.8

It states that you prefer to get a response in image/webp content type before any other image/* type. And thus you get WEBP image in your response:

HTTP/1.1 200 OK
...
Content-Length: 2564
...
Content-Type: image/webp
...
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

The next time you sent the same request you get instead a different response:

HTTP/1.1 200 OKheaders
...
Content-Length: 3833
...
Content-Type: image/jpeg
...
b'\xff\xd8\...<hex data here>...\xff\xd9'

This time you don't get a WEBP image but a JPEG image back which can be seen both in the Content-Type header and the response body.

I'm not completely sure why this is the case but I assume that the previous request from Chrome made the server create the JPEG image from the original source file and cache it locally for later requests so that it is now cheaper for the server to serve the pre-created JPEG file instead to newly create a WEBP file. And your Accept header stated that you support both formats.

Anyway, if your code does not support WEBP but only JPEG then you should not claim to be able to deal with WEBP in your Accept header. Instead you should only claim what you really support, i.e.

Accept: image/jpeg

Same is also true with other information you send in the request. For example you claim to support compressed response by sending Accept-Encoding: gzip, deflate but your code has no support to deal with a compressed response. Similar you are claiming to be able to deal with chunked transfer encoding and HTTP keep alive by sending a HTTP/1.1 request but your code has no support for any of these features either.

In summary you should probably send only this request to get what you want:

GET /.... HTTP/1.0
Host: www.rit.edu
Accept: image/jpeg