Search code examples
httpcachinghttpresponsehttp-cachingetag

What's the suggested way of storing a resource ETag?


Where should I store the ETag for a given resource?

Approach A: compute on the fly

Get the resource and compute the ETag on the fly upon each request:

$resource = $repository->findByPK($id); // query

// Compute ETag
$etag = md5($resource->getUpdatedAt());

$response = new Response();
$response->setETag($etag);
$response->setLastModified($resource->getUpdatedAt());

if($response->isNotModified($this->getRequest())) {
    return $response; // 304
}

Approach B: storing at database level

Saving a bit of CPU time while making INSERT and UPDATE statements a bit slower (we use triggers to get ETag updated):

$resource = $repository->findByPK($id); // query

$response = new Response();
$response->setETag($resource->getETag());
$response->setLastModified($resource->getUpdatedAt());

if ($response->isNotModified($this->getRequest())) {
    return $response;
}

Approach C: caching the ETag

This is like approach B but ETag is stored in some cache middleware.


Solution

  • I suppose it would depend on the cost of having available the items going into the ETag itself.

    I mean, the user sends along a request for a given resource; this should trigger a retrieval operation on the database (or some other operation).

    If the retrieval is something simple such as fetching a file, then inquiring on the file stats is fast, and there's no need of storing anything anywhere: a MD5 of the file path plus its update time is enough.

    If the retrieval implies querying a database, then it depends on whether you can decompose the query without losing performance (e.g., the user requests an article by ID. You might retrieve relevant data from the article table only. So a cache "hit" will entail a single SELECT on a primary key. But a cache "miss" means you have to query again the database, wasting the first query - or not - depending on your model).

    If the query (or sequence of queries) is well-decomposable (and the resulting code maintenable) then I'd go with the dynamic ETag again.

    If it is not, then most depends on the query cost and the overall cost of maintenance of a stored-ETag solution. If the query is costly (or the output is bulky) and INSERT/UPDATEs are few, then (and, I think, only then) it will be advantageous to store a secondary column (or table) with the ETag.

    As for the caching middleware, I don't know. If I had a framework keeping track of everything for me, I might say 'go for it' -- the middleware is supposed to caring and implementing the points above. Should the middleware be implementation-agnostic (unlikely, unless it's a cut-and-paste slap-on ... which is not unheard of), then there would be either the risk of it "screening" updates to the resource, or maybe an excessive awkwardness on invoking some cache-clearing API upon updates. Both factors would need to be evaluated against the load improvement offered by ETag support.

    I don't think that in this case a 'silver bullet' exists.

    Edit: in your case there is little - or even no - difference between cases A and B. To be able to implement getUpdatedAt(), you would need to store the update time in the model.

    In this specific case I think that it would be simpler and more maintainable the dynamic, explicit calculation of the ETag (case A). The retrieval cost is incurred in any case, and the explicit calculation cost is that of a MD5 calculation, which is really fast and completely CPU-bound. The advantages in maintainability and simplicity in my opinion are overwhelming.

    On a semi-related note, it occurs to me that in some cases (infrequent updates to the database and much more frequent queries to the same) it might be advantageous and almost transparent to implement a global Last-Modified time for the whole database. If the database has not changed, then there is no way that any query to the database can return varied resources, no matter what the query is. In such a situation, one would only need to store the Last-Modified global flag in some easy and quick to retrieve place (not necessarily the database). For example

    function dbModified() {
        touch('.last-update'); // creates the file, or updates its modification time
    }
    

    in any UPDATE/DELETE code. The resource would then add a header

    function sendModified() {
        $tsstring = gmdate('D, d M Y H:i:s ', filemtime('.last-update')) . 'GMT';
        Header("Last-Modified: " . $tsstring);
    }
    

    to inform the browser of that resource's modification time.

    Then, any request for a resource including If-Modified-Since could be bounced back with a 304 without ever accessing the persistency layer (or at least saving all persistent resource access). No update time at record level would (have to) be needed:

    function ifNotModified() {
        // Check out timezone settings. The GMT helps but it's not always the ticket
        $ims = isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])
            ? strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE'])
            : -1; // This ensures the test will FAIL
    
       if (filemtime('.last-update') <= $ims) {
           // The database was never updated after the resource retrieval.
           // There's no way the resource may have changed.
           exit(Header('HTTP/1.1 304 Not Modified'));
       }
    }
    

    One would put the ifNotModified() call as early as possible in the resource supply route, the sendModified as early as possible in the resource output code, and the dbModified() wherever the database gets modified significantly as far as resources are concerned (i.e., you can and probably should avoid it when logging access statistics to database, as long as they do not influence the resources' content).