Search code examples
xmlyamldtd

Concise way of defining your own YAML syntax?


For XML, there are Document Type Definitions (DTD) which define all elements, but is there something similar for YAML?

I found a post on Validating YAML with an XML DTD which suggests to use DTDs anyhow and/or a simple XML, but I am doubtful whether that is feasible in my case: My project decided to have a (custom) YAML format. From a YAML file in this format a rather intricate XML is algorithmically generated. The YAML contains much less information than the XML, but all significant things a human editor must know.

At the moment, the definition of my YAML is mainly prosaic (as quite abstract requirement text) an as the actual source code which does the parsing and conversion to XML. Both is not suitable for end users which are supposed to maintain the YAML file. Is there a clean and concise way to define my custom YAML syntax?


Solution

  • Firstly, YAML is the syntax. What you want to describe is not syntax, but structure.

    YAML is a serialization format. Therefore, the type of the data you serialize from and deserialize into is the structure description of a YAML file. Unless you're using YAML for data interexchange, you typically have one application the implements loading the YAML file.

    By default, a lot of YAML implementations deserialize to a heterogeneous structure of lists, dictionaries and simple values (string, int, …). However, if you assume a certain structure, you can write down types that define that structure and then load your YAML into an object of that type. Simple example (Java in this case):

    public class Book {
        public static class Person {
            public String name;
            public int age;
        }
    
        public Person author;
        public String title;
    }
    

    This type describes the structure of this YAML document:

    author:
      name: John Doe
      age: 23
    title: Very interesting title
    

    Any YAML implementation that is able to deserialize to types is able to inspect those types; either at runtime via reflection or at compile-time via macros or other means of compile-time evaluation. Therefore, you can inspect that structure as well and autogenerate documentation for the user with it (possibly employing JavaDoc comments for extended documentation).

    Now you might use a dynamically typed language. If that language is Python, you can still define classes to define your structure, and you can use type hints to define types of scalar values. This gives you user documentation, however you still need to implement validation manually since type hints are not enforced (PyYAML's add_path_resolver is the important hook here to resolve parts of the document graph to specific types without having to use YAML tags).

    For other languages, different solutions may exist. Generally, it's a good idea to maintain a single source of truth (SSOT) that describes the YAML structure and then use that as basis for both user documentation and validation. And since YAML is a serialization format, the target type is a natural choice for the SSOT if the language and YAML implementation allows you to define it.