I am trying to use PyYAML to load YAML 1.1 files (they are not tagged as such but they have the octal integer values 0123 instead of 0o123).
I don't know how these files were generated, but one of the problems is that some of these files have duplicate keys, like:
xxx:
aaa: 011
bbb: 012
ccc: 013
aaa: 014
I am using yaml.safe_load()
to load those files.
From reading the YAML documentation, section 10.2, I expected to get a warning and that aaa
will have the value 9:
It is an error for two equal keys to appear in the same mapping node. In such a case the YAML processor may continue, ignoring the second key: value pair and issuing an appropriate warning.
But I get no warning and the value is 12.
Is this a bug? Is there a way to get PyYAML to select the first value for the key?
I looked at a few libraries for other languages, to clean this up before further processing but those either did throw an error, or continued with the second value.
There are many files, often with the duplicates nested much deeper. They can have complex structures between the keys, and the duplicate keys also not being unique to the mapping they occur in, which is valid. Using awk to fix this is not going to work these files. And too many to fix by hand.
I would say that is a bug in PyYAML. The offending code is here:
def construct_mapping(self, node, deep=False):
if not isinstance(node, MappingNode):
raise ConstructorError(None, None,
"expected a mapping node, but found %s" % node.id,
node.start_mark)
mapping = {}
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
if not isinstance(key, collections.Hashable):
raise ConstructorError("while constructing a mapping", node.start_mark,
"found unhashable key", key_node.start_mark)
value = self.construct_object(value_node, deep=deep)
mapping[key] = value
return mapping
It is clear that no checks are done whether the key exists. You would have to subclass the Constructor
to make one that has construct_mapping()
with an included check:
if key in mapping:
warnings.warn(somewarning)
else:
mapping[key] = value
And then create a Loader
using that Constructor
.
It might be simpler to use ruamel.yaml
(disclaimer: I am the author
of that package). It correctly loads this, assuming you disable the DuplicateKeyError
,
and explicitly set YAML 1.1 as the input format:
import sys
import ruamel.yaml
yaml_file = Path('xx.yaml')
yaml = ruamel.yaml.YAML()
yaml.version = (1, 1)
yaml.indent(mapping=3, sequence=2, offset=0)
yaml.allow_duplicate_keys = True
data = yaml.load(yaml_file)
assert data['xxx']['aaa'] == 9
yaml_out = ruamel.yaml.YAML()
yaml_out.dump(data, sys.stdout)
This gives:
xxx:
aaa: 9
bbb: 10
ccc: 11
Your octals will be converted to decimals (normally that info is preserved, but not when loading legacy YAML 1.1). PyYAML will always do that.