Search code examples
pythonpython-3.xruamel.yamlordered-map

Parsing YAML, get line numbers even in ordered maps


I need to get the line numbers of certain keys of a YAML file.

Please note, this answer does not solve the issue: I do use ruamel.yaml, and the answers do not work with ordered maps.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from ruamel import yaml

data = yaml.round_trip_load("""
key1: !!omap
  - key2: item2
  - key3: item3
  - key4: !!omap
    - key5: item5
    - key6: item6
""")

print(data)

As a result I get this:

CommentedMap([('key1', CommentedOrderedMap([('key2', 'item2'), ('key3', 'item3'), ('key4', CommentedOrderedMap([('key5', 'item5'), ('key6', 'item6')]))]))])

what does not allow to access to the line numbers, except for the !!omap keys:

print(data['key1'].lc.line)  # output: 1
print(data['key1']['key4'].lc.line)  # output: 4

but:

print(data['key1']['key2'].lc.line)  # output: AttributeError: 'str' object has no attribute 'lc'

Indeed, data['key1']['key2] is a str.

I've found a workaround:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from ruamel import yaml

DATA = yaml.round_trip_load("""
key1: !!omap
  - key2: item2
  - key3: item3
  - key4: !!omap
    - key5: item5
    - key6: item6
""")


def get_line_nb(data):
    if isinstance(data, dict):
        offset = data.lc.line
        for i, key in enumerate(data):
            if isinstance(data[key], dict):
                get_line_nb(data[key])
            else:
                print('{}|{} found in line {}\n'
                      .format(key, data[key], offset + i + 1))


get_line_nb(DATA)

output:

key2|item2 found in line 2

key3|item3 found in line 3

key5|item5 found in line 5

key6|item6 found in line 6

but this looks a little bit "dirty". Is there a more proper way of doing it?

EDIT: this workaround is not only dirty, but only works for simple cases like the one above, and will give wrong results as soon as there are nested lists in the way


Solution

  • This issue is not that you are using !omap and that it doesn't give you the line-numbers as with "normal" mappings. That should be clear from the fact that you get 4 from doing print(data['key1']['key4'].lc.line) (where key4 is a key in the outer !omap).

    As this answers indicates,

    you can access the property lc on collection items

    The value for data['key1']['key4'] is a collection item (another !omap), but the value for data['key1']['key2'] is not a collection item but a, built-in, python string, which has no slot to store the lc attribute.

    To get an .lc attribute on a non-collection like a string you have to subclass the RoundTripConstructor, to use something like the classes in scalarstring.py (with __slots__ adjusted to accept the lc attribute and then transfer the line information available in the nodes to that attribute and then set the line, column information:

    import sys
    import ruamel.yaml
    
    yaml_str = """
    key1: !!omap
      - key2: item2
      - key3: item3
      - key4: !!omap
        - key5: 'item5'
        - key6: |
            item6
    """
    
    class Str(ruamel.yaml.scalarstring.ScalarString):
        __slots__ = ('lc')
    
        style = ""
    
        def __new__(cls, value):
            return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value)
    
    class MyPreservedScalarString(ruamel.yaml.scalarstring.PreservedScalarString):
        __slots__ = ('lc')
    
    class MyDoubleQuotedScalarString(ruamel.yaml.scalarstring.DoubleQuotedScalarString):
        __slots__ = ('lc')
    
    class MySingleQuotedScalarString(ruamel.yaml.scalarstring.SingleQuotedScalarString):
        __slots__ = ('lc')
    
    class MyConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        def construct_scalar(self, node):
            # type: (Any) -> Any
            if not isinstance(node, ruamel.yaml.nodes.ScalarNode):
                raise ruamel.yaml.constructor.ConstructorError(
                    None, None,
                    "expected a scalar node, but found %s" % node.id,
                    node.start_mark)
    
            if node.style == '|' and isinstance(node.value, ruamel.yaml.compat.text_type):
                ret_val = MyPreservedScalarString(node.value)
            elif bool(self._preserve_quotes) and isinstance(node.value, ruamel.yaml.compat.text_type):
                if node.style == "'":
                    ret_val = MySingleQuotedScalarString(node.value)
                elif node.style == '"':
                    ret_val = MyDoubleQuotedScalarString(node.value)
                else:
                    ret_val = Str(node.value)
            else:
                ret_val = Str(node.value)
            ret_val.lc = ruamel.yaml.comments.LineCol()
            ret_val.lc.line = node.start_mark.line
            ret_val.lc.col = node.start_mark.column
            return ret_val
    
    
    yaml = ruamel.yaml.YAML()
    yaml.Constructor = MyConstructor
    
    data = yaml.load(yaml_str)
    print(data['key1']['key4'].lc.line)
    print(data['key1']['key2'].lc.line)
    print(data['key1']['key4']['key6'].lc.line)
    

    Please note that the output of the last call to print is 6, as the literal scalar string starts with the |.

    If you also want to dump data, you'll need to make a Representer aware of those My.... types.