Search code examples
yamlruamel.yaml

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?


I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).

I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.

In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.

^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.

What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?

  • Emitter.expect_flow_mapping() probably too late for converting flow->block
  • Serializer.serialize_node() probably too late as it consults node.flow_style
  • RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
  • I could also walk the data before calling yaml.dump(), but this has no idea about line length.

So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?


Solution

  • What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.

    Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.


    A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.

    If you run the following, the initial dumped value for quote will be on one line. The change_to_block method as presented changes all mappings/sequences that are too long that are on one line.

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    movie: bladerunner
    quote: {[Batty, Roy]: [ 
             I have seen things you people wouldn't believe.,
             Attack ships on fire off the shoulder of Orion.,
             I watched C-beams glitter in the dark near the Tannhäuser Gate.,
           ]}
    """
        
    
    class Blockify:
        def __init__(self, width, only_first=False, verbose=0):
            self._width = width
            self._yaml = None
            self._only_first = only_first
            self._verbose = verbose
    
        @property
        def yaml(self):
            if self._yaml is None:
                self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
                y.preserve_quotes = True
                y.width = 2**16
            return self._yaml
    
        def __call__(self, d):
            pass_nr = 0
            changed = [True]
            while changed[0]:
                changed[0] = False
                try:
                    s = self.yaml.dumps(d)
                except AttributeError:
                    print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
                    sys.exit(1)
                if self._verbose > 1:
                    print(s)
                too_long = set()
                max_ll = -1
                for line_nr, line in enumerate(s.splitlines()):
                    if len(line) > self._width:
                        too_long.add(line_nr)
                    if len(line) > max_ll:
                        max_ll = len(line)
                if self._verbose > 0:
                    print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
                    sys.stdout.flush()
                new_d = self.yaml.load(s)
                self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
                d = new_d
                pass_nr += 1
            return d, s
    
        @staticmethod
        def change_to_block(d, too_long, changed, only_first):
            if isinstance(d, dict):
                if d.fa.flow_style() and d.lc.line in too_long:
                    d.fa.set_block_style()
                    changed[0] = True
                    return  # don't convert nested flow styles, might not be necessary
                # don't change keys if any value is changed
                for v in d.values():
                    Blockify.change_to_block(v, too_long, changed, only_first)
                    if only_first and changed[0]:
                        return
                if changed[0]:  # don't change keys if value has changed
                    return
                for k in d:
                    Blockify.change_to_block(k, too_long, changed, only_first)
                    if only_first and changed[0]:
                        return
            if isinstance(d, (list, tuple)):
                if d.fa.flow_style() and d.lc.line in too_long:
                    d.fa.set_block_style()
                    changed[0] = True
                    return  # don't convert nested flow styles, might not be necessary
                for elem in d:
                    Blockify.change_to_block(elem, too_long, changed, only_first)
                    if only_first and changed[0]:
                        return
    
    blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
    
    yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
    data = yaml.load(yaml_str)
    blockified_data, string_output = blockify(data)
    print('-'*32, 'result:', '-'*32)
    print(string_output)  # string_output has no final newline
    

    which gives:

    movie: bladerunner
    quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
    pass: 0, lines: [1], longest: 186
    movie: bladerunner
    quote:
      [Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
    pass: 1, lines: [2], longest: 179
    movie: bladerunner
    quote:
      [Batty, Roy]:
      - I have seen things you people wouldn't believe.
      - Attack ships on fire off the shoulder of Orion.
      - I watched C-beams glitter in the dark near the Tannhäuser Gate.
    pass: 2, lines: [], longest: 67
    -------------------------------- result: --------------------------------
    movie: bladerunner
    quote:
      [Batty, Roy]:
      - I have seen things you people wouldn't believe.
      - Attack ships on fire off the shoulder of Orion.
      - I watched C-beams glitter in the dark near the Tannhäuser Gate.
    

    Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style because the tuple subclass CommentedKeySeq does never get a line number attached.