Search code examples
pythonyaml

How to process this yaml without losing any of non-yaml keywords


I have the following yaml snippet I want to resolve the pointer to the anchor and also I don't want to lose !flatten and !ref which are to be processed by another program.

Input:

_ip_context: &ip_context
  ip_restriction: !flatten
  - !ref 'constant::public_cidr_blocks'

policies:
- file: policies/somepolicy.json
  context:
    <<: *ip_context

Desired output:

policies:
- file: policies/somepolicy.json
  context:
    ip_restriction: !flatten
    - !ref 'constant::public_cidr_blocks'

I tried this program which is produced by ChatGpt. But it didn't get me what I wanted:

import sys
import yaml

yaml_content = """
_ip_context: &ip_context
  ip_restriction: !flatten
  - !ref 'constant::public_cidr_blocks'

policies:
- file: policies/somepolicy.json
  context:
    <<: *ip_context
"""

class FlattenConstructor(yaml.constructor.SafeConstructor):
    def construct_flatten(self, node):
        return self.construct_sequence(node)

class RefConstructor(yaml.constructor.SafeConstructor):
    def construct_ref(self, node):
        return self.construct_scalar(node)

yaml.add_constructor('!flatten', FlattenConstructor.construct_flatten, Loader=yaml.SafeLoader)
yaml.add_constructor('!ref', RefConstructor.construct_ref, Loader=yaml.SafeLoader)

data = yaml.load(yaml_content, Loader=yaml.SafeLoader)

class FlattenRepresenter(yaml.representer.SafeRepresenter):
    def represent_flatten(self, data):
        return self.represent_sequence('!flatten', data)

class RefRepresenter(yaml.representer.SafeRepresenter):
    def represent_ref(self, data):
        return self.represent_scalar('!ref', data)

yaml.add_representer(list, FlattenRepresenter.represent_flatten)
yaml.add_representer(str, RefRepresenter.represent_ref)

#with open('output.yaml', 'w') as outfile:
yaml.dump(data, sys.stdout, default_flow_style=False,Dumper=yaml.SafeDumper)


This is the output:

_ip_context:
  ip_restriction: &id001
  - constant::public_cidr_blocks
policies:
- context:
    ip_restriction: *id001
  file: policies/somepolicy.json

Solution

  • I asked generative "AI" programs some questions about python and YAML (about which I imagine I know a thing or two), and had a good laugh at the answers that it gave.

    The code doesn't create different types for tagged and non-tagged sequences and scalars. So the output would have had tags attached to all of them, if the representer code would have worked. The code also fails to do anything to prevent the anchor and aliases from being created.

    Removing aliases without removing the anchor is described here. In your case things are on the one hand simpler, as you just can remove the part of the loaded data structure that you don't want, to get rid of the anchors/and aliases.

    ruamel.yaml will preserve the tags for you, without having to anything special, but it will also preserve the merge key, which is not what you want. To get rid of that you could update the representer, but that would require dupclicating a rather large piece of code from the method represent_mapping. So my preference is to just recursively walk over the data structure and getting rid of the merge information (which is equally dependent on ruamel.yaml internals, so pin the version you use):

    import sys
    import pathlib
    import ruamel.yaml
    
    file_name = Path('input.yaml')
    
    def un_merge(d):
        if isinstance(d, dict):
            if d.merge:
                for kvs in d.merge:
                    for k1, v1 in kvs[1].items():  # kvs[0] is the position of the merge
                        d[k1] = v1
                delattr(d, ruamel.yaml.comments.merge_attrib)
            for k, v in d.items():
                un_merge(k)
                un_merge(v)
        elif isinstance(d, list):
            for elem in d:
                un_merge(elem)
    
    
    yaml = ruamel.yaml.YAML()
    yaml.preserve_quotes = True
    data = yaml.load(file_name)
    del data['_ip_context']
    un_merge(data)
    yaml.dump(data, sys.stdout)
    

    which gives:

    policies:
    - file: policies/somepolicy.json
      context:
        ip_restriction: !flatten
        - !ref 'constant::public_cidr_blocks'
    

    and that looks like your desired output.

    By default ruamel.yaml removes superfluous quotes, and the quotes around constant::public_cidr_blocks are not necessary for parsers correctly handing colons within scalars (not all do). However within tagged scalars they are preserved, regardless of preserve_quotes. Only comment it out if you have untagged scalars with superfluous quotes.

    The order of the mapping keys is preserved (it wasn't in the output you got).

    If there had been comments on the anchor part of the original mapping, these would not have been "moved" automagically.