Search code examples
algorithmparsinggrammarcontext-free-grammar

Parsing X.509 ASN.1 binary data


According to this article, we can see the following in the paper:

enter image description here

  1. Can we treat the above X.509 grammar as this: the SEQUENCE and SET elements are non-terminals, while the rest are terminals. We could therefore also rewrite the grammar as follows:

     Certificate = A
     signatureAlgorithm = c
     signature = d
     ToBeSigned = T
    
     S => A | AA    ; either a single or many Certificate elements
     A => TA | Tcd  ; TA supports TA, TTA, TTTA, so we can have multiple T non-terminals (cd are non-terminals: signartureAlgorithm and signature which come at the end)
     T = vs...e     ; all of the elements from version, serialNumber, ..., extensions
    

What would be the best way to transform the above into a more usable grammar with standard letters: can we treat all ToBeSigned elements as simply as non-terminals: vertor = v, serialNumber = s, etc?

  1. According to the Chomsky Hierarchy, which kind of grammar is this? Is it a regular grammar?

Solution

  • Can we treat the above X.509 grammar as this: the SEQUENCE and SET elements are non-terminals, while the rest are terminals?

    I'd say that's basically a category error. But even if it were not, it's not a particularly useful model for what ASN.1 types are.

    A terminal is an atomic unit with no grammatical structure. (At least, that's one intuition for a terminal. In Chomskyian context-free grammars, the definition is even simpler: it's a symbol which does not appear on the left-hand side of any production.) But many of the types in that extract do have definitions in the X.509 documents, so they certainly aren't terminals. Arguably, BIT STRING is a terminal, since it's more or less a primitive ASN.1 type, but in concrete grammars in the BER and DER profiles, it's actually a non-terminal, too.

    "ASN" is an abbreviation for "Abstract Syntax Notation"; avoiding the ambiguity of English noun phrases, it is a notation for writing abstract syntaxes. ASN.1 abstracts away from the concrete syntax. It's a way of focussing on the structural interrelationships inside a complex data object without getting bogged down in the real-world mechanics of parsing or generating sentences. In that sense, it's more or less the mirror image of a Chomskyian grammar; Chomsky's goal (in the famous paper which gave rise to the hierarchy) was to abstract away from the semantics, leaving only the mechanics of parsing and generating sentences. It's in this sense that I think the question here is a category error; it's attempting to unify two conceptual systems whose essence is to separate the concepts.

    As a final note, I think you may be misunderstanding the ASN.1 concept of SEQUENCE, which is different from a SEQUENCE OF. A SEQUENCE is, roughly speaking, an abstraction of something like a C struct; it's a single composite type with named members. It's not a repetition. if you want to express the concept of an ordered list of objects of the same type, you use a SEQUENCE OF, which is vaguely similar to a C array, although the length constraints can be a lot more flexible.

    SET and SET OF are the same, except that they are unordered. I don't know of any language which implements anything resembling an ASN.1 SET, and I'm not sure what advantage such a thing would have. But SET OF types certainly exist in many languages, where they're usually called things like "unordered multiset". (Again, ASN.1 allows a the specification of complex constraints on particular SET OF objects. A common use case is representing a mapping type as a SET OF two-element SEQUENCEs.

    ASN.1 also has a CHOICE composite type, which (like a C++ std::variant) represents a single object of one of a specified list of types.

    One of the interesting features of ASN.1 is that it allows recursive definitions (although it disallows infinite concrete objects, so the recursive element needs to be optional or inside a possibly empty SEQUENCE OF/SET OF composite. Recursive definitions at least open the possibility of non-regular grammars.

    ASN.1 tools build serializers (generators) and deserializers (parsers) from a set of ASN.1 descriptions. There are a large variety of representation schemes which can be used, and most of them require a more complex syntax than might be apparent from the Abstract Syntax defined in ASN.1. For example, in BER sequence-of objects and character strings can be transparently represented either by preceding the elements with a precise length, or as a series of subsequence chunks followed by an "end-of-sequence" sentinel. Even integers are preceded by a length indicator, since an unconstrained ASN.1 integer does not have a size limit. (The ASN.1 definition can apply a constraint, and some representation schemes use this constraint to simplify the concrete representation.)

    X.509 certificates are serialized using the DER representation, which was designed to eliminate encoding options. The intent of DER is to create a one-to-one mapping between each representable value and a bit-vector which can be transmitted. Some aspects of DER are very difficult (if not impossible) to write in a context-free grammar, starting with the requirement that variable-length bit vectors be preceded by their length. Even more complicated is the DER requirement for transmitting SET-OF composites: the elements in the SET-OF must be individual encoded, and then the encodings are output in lexicographic bit ordering.

    The point of a canonical representation such as DER, by the way, is precisely to make it possibly to digitally sign a serialised datum.

    In my opinion, ASN.1 is a fascinating and lamentably undervalued body of work. Although it seems unlikely that it will every enjoy a revival (but, then, who knows?), much less that there will ever be an ASN.2, the astonishing range of issues considered in the design of ASN.1 elucidates the complex task of reliably transmitting serialized data in a way that could have immensely benefited the design of other "simpler" serialization technologies.