Hash algorithm for certificate / CRL directory

OpenSSL is able to use a specific directory structure for CA certificates and CRLs. If you pass a directory name as the third argument to SSL_CTX_load_verify_locations (as described in this question), it will look for CA certificates in this directory in order to verify client certificates. It finds the correct CA certificate by taking the hash of the issuer of the client certificate and appending an integer, e.g. 34bb8598.0. Usually, those names are symlinks pointing to the real files, and the symlinks are created using the c_rehash tool.

Likewise, OpenSSL can store certificate revocation lists in such directories, as described in this question, and look up the correct revocation list by the hash of the certificate issuer.

Now, I need to make a program reuse such a CRL directory. The program doesn't use OpenSSL, so I need to generate those hashes in some other way. What is the algorithm for generating those hashed filenames?

Solution

The hash format is not documented, so this is likely to change — in fact, it has changed once already. The x509 command supports the options -subject_hash and -issuer_hash along with -subject_hash_old and -issuer_hash_old. This description is for the "new" hash format as of OpenSSL 1.0.1f.

The X509_subject_name_hash and X509_issuer_name_hash functions just call X509_NAME_hash on the corresponding certificate attribute. That function takes the SHA-1 hash of the "canonical encoding" of the name, treats its first four bytes as a little-endian 32-bit integer, and returns it (effectively reversing the first four bytes of the hash).

So what is the "canonical encoding"? It is a mutation of the DER representation of the issuer name, generated by the function x509_name_canon. DER is a tag-length-value encoding. The object tree we're representing looks like this:

rdnSequence, with tag 0x31 (decimal 49)
- One or more RelativeDistinguishedName items, each with tag 0x30 (decimal 48)
  - A type, represented as an OID, with tag 0x06
  - A string value — and this is where it gets interesting

The string values, as given in the certificate, can be represented with a number of different types, e.g. a "printable string" with tag 0x13, an "IA5 string" with tag 0x16, or a UTF-8 string with tag 0x0c.

When generating the "canonical encoding", the value for each item in the RDNSequence gets converted to UTF-8, and reencoded as a UTF-8 string with the tag 0x0c. This happens in the asn1_string_canon function. Furthermore, the following transformations are applied:

Any leading and trailing whitespace is removed. Any byte with its high-order bit set is let through without change, so "whitespace" in this context means space, form-feed, newline, carriage return, horizontal tab and vertical tab.
Inside the string, any run of one or more whitespace characters as defined above gets replaced with a single space (0x20).
Characters are converted to lowercase. Since any byte with its high-order bit set is ignored, this only applies to the ASCII letters A to Z.

And that's all you need to do.

Note that the ASN.1 definitions of some of the fields in question do not permit UTF-8 strings (for example, country codes are restricted to "printable strings"), so you may not be able to use your ASN.1 encoding library directly.