Search code examples
pythonpython-3.xyamlruamel.yaml

ruamel.yaml loses anchor and skips aliases after the first alias on the same level when dumping


Given this YAML:

---
sp-database: &sp-database
  DATABASE_NAME: blah
  DATABASE_PORT: 5432
  DATABASE_SCHEMA: public
  DATABASE_USERNAME: foo
  DATABASE_DRIVER: bar
  DATABASE_TYPE: pg

rabbit: &rabbit
  RABBIT_PORT: 5672
  RABBIT_USERNAME: foo

sp-env: &sp-env
  <<: *sp-database
  <<: *rabbit
  REDIS_PORT: 6379

when I read this code in and dump it out:

def blah(self):
    values_file = './src/values.yaml'
    with open(values_file, 'r') as stream:
        data = self.yaml.load(stream)
    values_file='./src/values1.yaml'
    with open(values_file, 'w') as file:
        self.yaml.indent(sequence=4, offset=2)
        self.yaml.dump(data, file)

The closest solution I found was this: How to generate multiple YAML anchors and references in Python?

in which I did change the alias usage to this:

sp-env: &sp-env
  <<: [ *sp-database, *rabbit ]
  REDIS_PORT: 6379

and it works but I want to figure out why it's not working with the sequential aliases, not the array subscripted aliases.


Solution

  • The [YAML specification describes mappings](https://yaml.org/spec/1.2.2/#The YAML specification describes mappings quite clearly:

    The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique.

    Although ruamel.yaml always has happely ignored the "unordered" part in there, to be able to keep the keys in the output in the same order, it doesn't ignore the fact that your input contains non-unique keys. Since what you you call "array subscripted aliases", are using duplicate keys this throws an error:

    import sys
    import ruamel.yaml
    
        
    file_in = Path('values.yaml')
    file_out = Path('values.yaml')
    yaml = ruamel.yaml.YAML()
    yaml.indent(mapping=4, sequence=4, offset=2)
    try:
        data = yaml.load(file_in)
    except Exception as e:
        print('exception:', e)
    else:
        yaml.dump(data, sys.stdout)
    

    which gives:

    exception: while constructing a mapping
      in "values.yaml", line 14, column 9
    found duplicate key "<<"
      in "values.yaml", line 16, column 3
    
    To suppress this check see:
       http://yaml.readthedocs.io/en/latest/api.html#duplicate-keys
    
    Duplicate keys will become an error in future releases, and are errors
    by default when using the new API.
    

    As it is not allowed to have duplicate keys, it is unclear what to expect if you load something with a library that doesn't follow the standard. Assume that a non-merge key appears twice and such a faulty library process the key/value pairs in the order that the keys appear in the YAML file. If it puts each key/value pair in a Python dict (without checking if the key already exists) the value from the second occurence overwrites the one from the first.

    YAML:

    - a: 1
    - b: 2
    - a: 3
    

    would give the Python dict: {a: 3, b: 2}. It should be clear that if that library would use some stack to first gather all the key/value pairs and process them by popping the value for a would be different and that would still get the same result after loading for valid YAML

    Processing the merge key makes this more complex. The merge key specification states:

    If the value associated with the key is a single mapping node, each of its key/value pairs is inserted into the current mapping, unless the key already exists in it. If the value associated with the merge key is a sequence, then this sequence is expected to contain mapping nodes and each of these nodes is merged in turn according to its order in the sequence. Keys in mapping nodes earlier in the sequence override keys specified in later mapping nodes.

    You implement the semantics for the merge key can be implemented by first populating the Python dict with the key/values from the merge key so you don't have to check for "unless the key already exists", and then potentially overwriting them with those of the keys that occur in the mapping with the dict. If you don't want to check for a merge key up front, you can process the "normal" keys and when you encounter a merge key, only add the key/value pairs from the merge for which the key is not yet in the dict.

    In both cases this leads to the situation in which you have a dict in the making. that might have some keys that can be overwritten by keys you process for that mapping later on (i.e. those coming from a merge) and those that should throw an error (i.e. duplicate keys). Failing to check for the duplication might give you faulty data (accepting duplicates where it shouldn't).

    When the value for a merge key consists of a sequence of mapping nodes (which often are, but don't have to be aliases) the processing order matters and is explicitly specified. Values for keys occuring in mappings earlier in the sequence are taken. In your second YAML example that means you'll get the value 5432 for DATABASE_PORT), in your first example it is not clear what that value would be as there should be no ordering associated with the keys. In that case (assuming no exception is thrown) *rabbit could be processed before *sp-database or after it, and the way it is processed (as described above) would also influence the outcome.

    By the spec requiring unique keys, all of this processing order mess is prevented. But if a library ignores that requirement it will still get results, but those are implementation dependent. They problems lies in the fact that these results are probably consistent and some people (who don't read the specs), start to rely on them. So then you're processing something that is almost, but quite unlike, YAML and you might have to continue to allow your library to rely on that, instead of following the spec, at least until the next major version number change.