Search code examples
python-3.xyamlpyyaml

PyYAML loading YAML 1.1 with duplicate keys


I am trying to use PyYAML to load YAML 1.1 files (they are not tagged as such but they have the octal integer values 0123 instead of 0o123).

I don't know how these files were generated, but one of the problems is that some of these files have duplicate keys, like:

xxx:
   aaa: 011
   bbb: 012
   ccc: 013
   aaa: 014

I am using yaml.safe_load() to load those files.

From reading the YAML documentation, section 10.2, I expected to get a warning and that aaa will have the value 9:

It is an error for two equal keys to appear in the same mapping node. In such a case the YAML processor may continue, ignoring the second key: value pair and issuing an appropriate warning.

But I get no warning and the value is 12.

Is this a bug? Is there a way to get PyYAML to select the first value for the key?

I looked at a few libraries for other languages, to clean this up before further processing but those either did throw an error, or continued with the second value.

There are many files, often with the duplicates nested much deeper. They can have complex structures between the keys, and the duplicate keys also not being unique to the mapping they occur in, which is valid. Using awk to fix this is not going to work these files. And too many to fix by hand.


Solution

  • I would say that is a bug in PyYAML. The offending code is here:

    def construct_mapping(self, node, deep=False):
        if not isinstance(node, MappingNode):
            raise ConstructorError(None, None,
                    "expected a mapping node, but found %s" % node.id,
                    node.start_mark)
        mapping = {}
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            if not isinstance(key, collections.Hashable):
                raise ConstructorError("while constructing a mapping", node.start_mark,
                        "found unhashable key", key_node.start_mark)
            value = self.construct_object(value_node, deep=deep)
            mapping[key] = value
        return mapping
    

    It is clear that no checks are done whether the key exists. You would have to subclass the Constructor to make one that has construct_mapping() with an included check:

            if key in mapping:
                 warnings.warn(somewarning)
            else:
                mapping[key] = value
    

    And then create a Loader using that Constructor.

    It might be simpler to use ruamel.yaml (disclaimer: I am the author of that package). It correctly loads this, assuming you disable the DuplicateKeyError, and explicitly set YAML 1.1 as the input format:

    import sys
    import ruamel.yaml
    
    yaml_file = Path('xx.yaml')
    
    yaml = ruamel.yaml.YAML()
    yaml.version = (1, 1)
    yaml.indent(mapping=3, sequence=2, offset=0)
    yaml.allow_duplicate_keys = True
    data = yaml.load(yaml_file)
    assert data['xxx']['aaa'] == 9
    yaml_out = ruamel.yaml.YAML()
    yaml_out.dump(data, sys.stdout)
    

    This gives:

    xxx:
      aaa: 9
      bbb: 10
      ccc: 11
    

    Your octals will be converted to decimals (normally that info is preserved, but not when loading legacy YAML 1.1). PyYAML will always do that.