Search code examples
pythonyamlruamel.yaml

Flexibility in handling YAML duplicate key entries


I am using YAML files to allow users to configure a serial workflow to a python program that I am developing:

step1:
    method1:
        param_x: 44
    method2:
        param_y: 14
        param_t: string   
    method1:
        param_x: 22
step2:
    method2:
        param_z: 7
    method1:
        param_x: 44
step3:
    method3:
        param_a: string

This is then be parsed in python and stored as a dictionary. Now, I know duplicate keys in YAML and python dictionaries are not allowed (why, btw?), but YAML seems perfect for my case given it's clarity and minimalism.

I tried to follow an approach suggested in this question (Getting duplicate keys in YAML using Python). However, in my case, sometimes they are duplicated, and sometimes not and using the proposed construct_yaml_map, this will either create a dict or a list, which is not what I want. Depending on the node depth I would like to be able to send keys and values on the second level (method1, method2, ...) to a list within a python dictionary, do avoid the duplication issue.


Solution

  • While parsing ruamel.yaml has no concept of depth beyond being at the root level of a document (among other things in order to allow for root level literal scalars to be unindented). Adding such a notion of depth is going to be difficult, since you have to deal with aliases and possible recursive occurrences of data, I am also not sure what this would mean in general (although clear enough for your example).

    The method creating a mapping in the default, round-trip, loader of ruamel.yaml is rather long. But if you are going to jumble mapping values together, you should not expect to be able to dump them back. let alone preserve comments, aliases, etc. The following assumes you'll be using the simpler safe loader, have aliases and/or merge keys.

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    step1:
        method1:
            param_x: 44
        method2:
            param_y: 14
            param_t: string   
        method1:
            param_x: 22
    step2:
        method2:
            param_z: 7
        method1:
            param_x: 44
    step3:
        method3:
            param_a: string
    """
    
    from ruamel.yaml.nodes import *
    from ruamel.yaml.compat import Hashable, PY2
    
    
    class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
        def construct_mapping(self, node, deep=False):
            if not isinstance(node, MappingNode):
                raise ConstructorError(
                    None, None, 'expected a mapping node, but found %s' % node.id, node.start_mark
                )
            total_mapping = self.yaml_base_dict_type()
            if getattr(node, 'merge', None) is not None:
                todo = [(node.merge, False), (node.value, False)]
            else:
                todo = [(node.value, True)]
            for values, check in todo:
                mapping = self.yaml_base_dict_type()  # type: Dict[Any, Any]
                for key_node, value_node in values:
                    # keys can be list -> deep
                    key = self.construct_object(key_node, deep=True)
                    # lists are not hashable, but tuples are
                    if not isinstance(key, Hashable):
                        if isinstance(key, list):
                            key = tuple(key)
                    if PY2:
                        try:
                            hash(key)
                        except TypeError as exc:
                            raise ConstructorError(
                                'while constructing a mapping',
                                node.start_mark,
                                'found unacceptable key (%s)' % exc,
                                key_node.start_mark,
                            )
                    else:
                        if not isinstance(key, Hashable):
                            raise ConstructorError(
                                'while constructing a mapping',
                                node.start_mark,
                                'found unhashable key',
                                key_node.start_mark,
                            )
                    value = self.construct_object(value_node, deep=deep)
                    if key in mapping:
                        if not isinstance(mapping[key], list):
                            mapping[key] = [mapping[key]]
                        mapping[key].append(value)
                    else:
                        mapping[key] = value
                total_mapping.update(mapping)
            return total_mapping
    
    
    yaml = ruamel.yaml.YAML(typ='safe')
    yaml.Constructor = MyConstructor
    data = yaml.load(yaml_str)
    for k1 in data: 
        # might need to guard this with a try-except for non-dictionary first-level values
        for k2 in data[k1]:
             if not isinstance(data[k1][k2], list):   # make every second level value a list
                 data[k1][k2] = [data[k1][k2]]
    print(data['step1'])
    

    which gives:

    {'method1': [{'param_x': 44}, {'param_x': 22}], 'method2': [{'param_y': 14, 'param_t': 'string'}]}