Search code examples
yamlsnakeyaml

In YAML, must a quoted scalar be interpreted by a parser as a string?


I've seen advice around the Internet that if you want a YAML scalar value to be processed as a string, you should quote it:

foo : "2018-04-17"

In the example above, this advice is intended to tell me that the value 2018-04-17 will be processed by any given YAML parser as its native language's string type. For example, SnakeYAML would, if this advice were true, interpret this as a java.lang.String, and not as a java.util.Date. (As it happens, SnakeYAML interprets this as a java.util.Date, quotes or not, which is why I'm asking this question.)

But although this advice may happen to work with any given parser, I can't see where in the YAML 1.2. specification this advice might come from. The closest thing I can find is the following sentence:

YAML allows scalars to be presented in several formats. For example, the integer “11” might also be written as “0xB”. Tags must specify a mechanism for converting the formatted content to a canonical form for use in equality testing. Like node style, the format is a presentation detail and is not reflected in the serialization tree and representation graph.

And this one:

The scalar style is a presentation detail and must not be used to convey content information, with the exception that plain scalars are distinguished for the purpose of tag resolution.

And this one:

Note that resolution must not consider presentation details such as comments, indentation and node style.

Nevertheless, I see lots of YAML documents that rely on the double-quoting-the-value-means-it-will-be-parsed-as-a-string advice, which makes me think I'm misreading something. Is there contention on this subject?


Solution

  • Relevant section from the YAML 1.1 spec (note that SnakeYaml is YAML 1.1 and therefore, the 1.2 spec does not necessarily apply):

    It is not required that all the tags of the complete representation be explicitly specified in the character stream. During parsing, nodes that omit the tag are given a non-specific tag: “?” for plain scalars and “!” for all other nodes. [...]

    It is recommended that nodes having the “!” non-specific tag should be resolved as “tag:yaml.org,2002:seq”, “tag:yaml.org,2002:map” or “tag:yaml.org,2002:str” depending on the node’s kind. This convention allows the author of a YAML character stream to exert some measure of control over the tag resolution process. By explicitly specifying a plain scalar has the “!” non-specific tag, the node is resolved as a string, as if it was quoted or written in a block style. Note, however, that each application may override this behavior. For example, an application may automatically detect the type of programming language used in source code presented as a non-plain scalar and resolve it accordingly.

    So to sum up, a YAML processor is not required to parse quoted scalars as string, and YAML also does not dictate which native type tag:yaml.org,2002:str does map to. And in fact, most YAML implementations do only follow parts of that advice. For example, if you deserialise YAML into a POJO/JavaBean with SnakeYaml, you typically do not use any explicit tags in your YAML, but your mappings are resolved to the corresponding Java classes in the root class' structure, instead of the generic Map which is what this advice suggests (since all mappings without explicit tags get the ! non-specific tag).

    Note that this has been changed in YAML 1.2:

    During parsing, nodes lacking an explicit tag are given a non-specific tag: “!” for non-plain scalars, and “?” for all other nodes.

    That's closer to most implementations, but for example, if you deserialise into a class class Foo { String bar; }, this will still load although bar is not a string, but a field name:

    "bar": some value
    

    So the advice for using YAML is to specify the desired structure on the application side – in SnakeYaml, you would set the root class type, and then every value will be mapped to the required type at its point in the hierarchy, as long as it is able to map there, regardless of whether it is quoted or unquoted. In general, it makes more sense for the application to specify which kind of value it expects throughout the hierarchy instead of the YAML author to do that via quoting. This is also conformant with the YAML spec, which says

    Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node, and (3) the content (and hence the kind) of the node.

    Resolving a tag is the YAML term for determining the target type. And it is allowed to determine the target type based on its position in the hierarchy: The root type is determined by the fact that the element is the root of the YAML document and in the case of SnakeYaml, may be fed in via the API. All other types are determined by the fact that they are descendants from the root type.

    Final note: If you really really want something to be a string, !!str 2018-04-17 will do since it sets a specific tag for the node.