Search code examples
cachingcdnsquidamazon-cloudfrontproxy-server

How to get started with web caching, CDNs, and proxy servers?


I'm newbie programmer building a startup that I (naturally) hope will create a large amount of traffic. I am hosting my django project on dotcloud, which is on Amazon EC2. I have some streaming media (Http though, not rmtp) so the dotcloud guys recommended I go with a CDN. I am also using Amazon S3 for storage and so decided to go with Amazon CloudFront as my CDN.

The time has come where I need to turn my attention to caching and I am lost and confused. I am completely new to the concept. The entire extent of my knowledge comes from a tutorial I just read (http://www.mnot.net/cache_docs/) and a confusing weekend spent consulting google. Most troubling of all is that I am not even sure what I need to do for my site.

  1. What is the difference between a CDN and a proxy server?

  2. Is it possible I might want to use a caching service (e.g. memcached, redis), a CDN (CloudFront), AND a proxy server (squid)?

  3. Our site is DB driven and produces dynamically generated lists specific to user locations. Can such a site be cached? (The lists themselves are filterable via AJAX, so the URL might remain the same while producing largely different results. For instance, example.com/some_url/ might generate a list of 40 objects, but only 10 appearing on the page. By clicking on a filter, the user could end up with 10 different objects while still at /some_url/)

  4. What are the best practices for a high traffic, rich content site?

  5. How can I learn about this? Everywhere I look seems to take for granted some basics that I just don't have as a part of my own foundation yet.

I'm not certain I'm asking the right questions. Just feeling very lost. I've now built 95% of my entire site and thought I was just ironing out the details but caching seems like another major undertaking. Any guidance/advice/encouragement would be much appreciated!


Solution

  • Right then let's start with caching...

    Caching is about storing something on a temporary basis so that you don't have to perform a more expensive operation to retrieve it every time.

    HTTP caching is about saving round-trips to servers, if you just use default behaviour a browser will ask the server to "send me a copy of this resource if you have a more recent version"

    If you set expires header to a future time, then the browser doesn't ask this question as it knows it can use the copy of the resource it's got.

    Caching at this level improves the end-users experience and saves you bandwidth.

    From your brief description HTTP caching could help with the smaller static files (have a read of ch3 of bookofspeed.com)

    DB caching as memcached (and redis) are used for are about reducing the load on databases (for example) by saving the results on an operation and then serving them from the cache rather than repeating the database operation)

    In your situation you would cache at the data retrieval layer based on the request parameters (and perhaps ensure the HTTP responses to the client aren't cached).

    CDNs vs Proxy Servers...

    These are really different beasts - CDNs are about keeping content close to your visitors so reducing latency - if you're serving large files it also puts them on a network optimised for it instead of your servers but there's a £££ price attached to doing that. Some CDNs e.g. cloud front have a proxy like behaviour where they go back to your origin server if they don't have the file the visitor wants.

    Proxy servers are literally servers that sit between your server and the end visitor - they might be part of your server farm (reverse proxy) the ISP's network or the visitor's network.

    A reverse proxy is essentially offloading the work of communication with the end-visitor from your servers e.g. if they have a slow connection they'll tie up a server generating a page for longer. Reverse proxies can also sit infront of multiple servers - either all doing the same thing or different things and the proxy presents a single address to the outside world. Squid is one proxy you might use but Varnish is very popular ATM too.

    Normal proxies just act as caches for those visitors who come through them e.g. a company may have a caching proxy server at their internet gateway so that the first person visiting an external site gets to retrieve a file and subsequent visitors get it form the proxy - they get a faster experience and the company reduces their bandwidth consumption.

    I'm guessing you don't have a high traffic site at the moment so your challenge is to understand where to spend your effort i.e. what needs optimising when.

    My first recommendation would be to get some real user monitoring (RUM) in, even if it's building your own using Boomerang.js or Pion. Also look at monitoring tools such as Cacti/Munin/CollectD so you can understand the load on your servers.

    Understanding your users experience is key to working out where you need to optimise.