Search code examples
apachehttpcachingetag

Implementing HTTP ETags in a web server


I'm currently looking into the possibility of implementing ETags in a Web Server, to support only the conditional GET. The Web Server is written in C++ and runs only on a Windows OS. After doing some research I have a few questions...Do servers that implement this feature generally cache the ETag GUID for a particular file? I'm not too familiar with the Apache code base, but I was able to locate the ap_condition_if_none_match function but, it isn't entirely clear to me how they check the GUID value for the if-none-match header. If they do cache things and the file were to change outside of the server doing anything (ie, user updated it), how would the server know the file in it's cache is no longer valid? Are they maybe using some API to "watch" for directory changes?

Edit: I am reviewing some info I found here: https://httpd.apache.org/docs/2.4/caching.html


Solution

  • In Apache, the ETag is made from the file's inode, size, and last-modified time: http://httpd.apache.org/docs/2.2/mod/core.html#FileETag

    There are different options, you can make them configurable. I will give you a list of possible options, from least to most reliable:

    1. [FASTEST OPTION] Check last file modification time with higher frequency than 1 second. For example, in Windows, the file time is measured in 100-nanosecond intervals. Also check the file size and inode as Apache does. Under Windows, instead of inode, you can query the file ID of an open handle via GetFileInformationByHandle. See nFileIndexHigh, nFileIndexLow; this is the high and low parts respectively of the file ID which is 64 bits. If the file time, size and inode has changed, recalculate the hash.
    2. [SAFER OPTION] Besides file time, size and inode, also check the content of the file using a very fast CRC32 function implemented by Intel (SSE4.2) – it much faster than any CRC32 implementation existed before SSE4.2. If file time or the CRC32 has changed, recalculate the hash.
    3. [FAST AND SAFE OPTION BUT CONSUMES HANDLES] Only calculate hashes while your server is running. When your server first starts, it should be no hashes calculated. If a file is first requested, calculate the hash and store it until the server exits. While the server is running, monitor file changes (of the files for which you have hashes) using Operating System’s file changes notifications. For example, in Widnows, there is FindFirstChangeNotification.

    For the hash value of the ETag itself, I would have recommended a cryptographic hash function, even one that is no longer strong for digital signatures. I would not recommend a hash function not explicitly designed to be cryptographically strong, since they do not produce such a small digest as crypto hashes for a comparable level of resistance to collisions. By collision I mean two different files produce the same hash. MD5 is still very good for file content change monitoring - given its high sped and small digest size. It is the fastest 128-bit hash function available from those initially designed for cryptography. You can also find a fast MD5 implementation in assembly, for example from OpenSSL or from https://www.nayuki.io/page/fast-md5-hash-implementation-in-x86-assembly or https://github.com/maximmasiutin/MD5_Transform-x64 - the performance of the last one is 4.94 CPU cycles per byte on processors with Skylake microarchitecture.