Search code examples
yamlmultilinepyyaml

Reading and writing back yaml files with multi-line strings


I have to read a yaml file, modify it and write back using pyYAML. Every thing works fine except when there is multi-line string values in single quotes e.g. if input yaml file looks like

FOO:
  - Bar: '{"HELLO":
"WORLD"}'

then reading it as data=yaml.load(open("foo.yaml")) and writing it yaml.dump(data, fref, default_flow_style=False) generates something like

FOO:
- Bar: '{"HELLO": "WORLD"}'

i.e. without the extra line for Bar value. Strange thing is that if input file has something like

FOO:
- Bar: '{"HELLO":

    "WORLD"}'

i.e. one extra new line for Bar value then writing it back generates the correct number of new lines. Any idea what I am doing wrong?


Solution

  • You are not doing anything wrong, but you probably should have read more of the YAML specification.

    According to the (outdated) 1.1 spec that PyYAML implements, within single quoted scalars:

    In a multi-line single-quoted scalar, line breaks are subject to (flow) line folding, and any trailing white space is excluded from the content.

    And line-folding:

    Line folding allows long lines to be broken for readability, while retaining the original semantics of a single long line. When folding is done, any line break ending an empty line is preserved. In addition, any specific line breaks are also preserved, even when ending a non-empty line.

    This means that your first two examples are the same, as the line-break is read as if there is a space.

    The third example is different, because it actually contains a newline after loading, because "any line break ending an empty line is preserved". In order to understand why that dumps back as it was loaded, you have to know that PyYAML doesn't maintain any information about the quoting (nor about the single newline in the first example), it just loads that scalar into a Python string. During dumping PyYAML evaluates how that string can best be written and the options it considers (unless you try to force things using the default_style argument to dump()): plain style, single quoted style, double quoted style.

    PyYAML will use plain style (without quotes) when possible, but since the string starts with {, this leads to confusion (collision) with that character's use as the start of a flow style mapping. So quoting is necessary. Since there are also double quotes in the string, and there are no characters that need backslash escaping the "cleanest" representation that PyYAML can choose is single quoted style, and in that style it needs to represent a line-break by including an emtpy line withing the single quoted scalar.

    I would personally prefer using a block style literal scalar to represent your last example:

    FOO:
    - Bar: |
      {"HELLO":
        "WORLD"}
    

    but if you load, then dump that using PyYAML its readability would be lost.

    Although worded differently in the YAML 1.2 specification (released almost 10 years ago) the line-folding works the same, so this would "work" in a similar way with a more up-to-date YAML loader/dumper. My package ruamel.yaml, for loading/dumping YAML 1.2 will properly maintain the block style if you set the attribute preserve_quotes = True on the YAML() instance, but it will still get rid of the newline in your first example. This could be implemented (as is shown by ruamel.yaml preserving appropriate newline positions in folded style block scalars), but nobody ever asked for that, probably because if people want that kind of control over wrapping they use a block style to start with.