Search code examples
pythonyamlpyyaml

How can I parse YAML with TAGs?


I have a YAML document like this:

 steps:
  - !<!entry>
    id: Entry-1
    actions: []
  - !<!replybuttons>
    id: ReplyButtons-langcheck
    footer: ''
  - !<!input>
    id: Input-langcheck
    var: Input-1
  - !<!logic>
    id: LangCheck-Logic
    entries:
      - condition: !<!equals>
          var: Input-langcheck
          isCaseSensitive: false

And I try to read it:

import yaml

yaml.safe_load(yaml_text)

But I have an error:

yaml.constructor.ConstructorError: could not determine a constructor for the tag '!entry'

How can I parse YAML with such tags?

This option also doesn't work.

def construct_entry(loader, node):
    value = loader.construct_scalar(node)
    return value

yaml.SafeLoader.add_constructor('!<!entry>', construct_entry)
result = yaml.safe_load(yaml_text)

If I try to use ruamel.yaml I can read the YAML documet, but I still don't understand how I can know about tags in python data.

import sys
from ruamel.yaml import YAML


class Entry:
    yaml_tag = '!<!entry>'

    def __init__(self, value, style=None):
        self.value = value
        self.style = style

    @classmethod
    def to_yaml(cls, representer, node):
        return representer.represent_scalar(cls.yaml_tag,
                                            u'{.value}'.format(node), node.style)

    @classmethod
    def from_yaml(cls, constructor, node):
        return cls(node.value, node.style)


yaml_text = """\
steps:
  - !<!entry>
    id: 1
    action: 2
  - !<!entry>
    id: 2
    action: 3
"""


yaml1 = YAML(typ='rt')

data1 = yaml1.load(yaml_text)

print(f'{data1=}')
yaml1.dump(data1, sys.stdout)

yaml2 = YAML(typ='rt')
yaml2.register_class(Entry)

data2 = yaml2.load(yaml_text)

print(f'{data2=}')
yaml1.dump(data2, sys.stdout)

The effect is exactly the same.

data1=ordereddict([('steps', [ordereddict([('id', 1), ('action', 2)]), ordereddict([('id', 2), ('action', 3)])])])
steps:
- !entry
  id: 1
  action: 2
- !entry
  id: 2
  action: 3
data2=ordereddict([('steps', [ordereddict([('id', 1), ('action', 2)]), ordereddict([('id', 2), ('action', 3)])])])
steps:
- !entry
  id: 1
  action: 2
- !entry
  id: 2
  action: 3

Solution

  • If you just need to inspect the tags and , the corresponding loaded dict and list subclasses preserve their tag in the .tag attribute (this might change so pin the version of ruamel.yaml you use):

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    steps:
    - !<!entry>
      id: Entry-1
      actions: []
    - !<!replybuttons>
      id: ReplyButtons-langcheck
      footer: ''
    - !<!input>
      id: Input-langcheck
      var: Input-1
    - !<!logic>
      id: LangCheck-Logic
      entries:
        - condition: !<!equals>
            var: Input-langcheck
            isCaseSensitive: false
    """
        
    yaml = ruamel.yaml.YAML()
    data = yaml.load(yaml_str)
    print('id', data['steps'][1]['id'])
    print('tag', data['steps'][1].tag.value)
    

    which gives:

    id ReplyButtons-langcheck
    tag !replybuttons
    

    That your first attempt didn't work lies in the fact that your tags are special because of the <>, these are verbatim tags, in this case necessary to allow a tag starting with an exclamation mark. So when the YAML contains !<abc> you register !abc with add_constructor (and I think you can leave out the !) and when your YAML contains !<!abc> you register !abc. The parser strips the <> for these verbatim tags, that is why that printed tag doesn't contain them after loading.

    Writing this I noticed that the round-trip parser doesn't check if a tag needs to be written verbatim. So if you dump the loaded data, you get non-verbatim tags, which don't load the same way. So if you need to update these files, then you should to get the classes registered (let me know if that doesn't work out). Recursively walking over the data structure and rewrite the tags to compensate for this bug will not work as the <> gets escaped while dumping.