Search code examples
pythonformatyamlpyyaml

Control fold position while using representer in PyYAML


I am able to dump YAML code with long strings in folded form with this code:

import yaml

class folded_str(str): pass

def folded_str_representer(dumper, data):
    return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='>')

yaml.add_representer(folded_str, folded_str_representer)

data = {
    'foo': folded_str(('abcdefghi ' * 10) + 'end\n'),
}

print(yaml.dump(data))

The output for the above code is:

foo: >
  abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
  abcdefghi abcdefghi end

Is it possible to control the length after which the folds should occur? For example, if I want the lines to fold after 70 characters, then the output would look like this:

foo: >
  abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
  abcdefghi abcdefghi abcdefghi end

Is there a way to make PyYAML do this?


Solution

  • The easy way to control how long the lines that PyYAML puts out with folding, is to provide the (global) line length with the parameter width:

    import sys
    import yaml
    
    class folded_str(str): pass
    
    def folded_str_representer(dumper, data):
        return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='>')
    
    yaml.add_representer(folded_str, folded_str_representer)
    
    data = {
        'foo': folded_str(('abcdefghi ' * 10) + 'end\n'),
    }
    
    yaml.dump(data, sys.stdout, width=70)
    

    which gives:

    foo: >
      abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
      abcdefghi abcdefghi abcdefghi end
    

    As you can see, I removed your call to print. PyYAML has a streaming interface and by not directly streaming to output, it needs to make an in-memory interpretation of the output which is both unnecessarily slow and memory in-efficient.

    Of course this also affects any other lines that get dumped (long non-folded scalars, flow-style lists, deeply nested data-structures.

    The non-easy way is not to call the represent_scalar routine, and adapt PyYAML's ScalarNode (or create your own Node type), that then does output a newline in the appropriate position when emitting.


    My ruamel.yaml has this functionality built in, to allow such output to round-trip with the fold position preserved (even thought the default output width is the same as PyYAML's)

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    [long, scalar]: "This is just a filler to show that the default width is 80 chars"
    foo: >
      abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
      abcdefghi abcdefghi abcdefghi end
    """
    
    yaml = ruamel.yaml.YAML()
    data = yaml.load(yaml_str)
    yaml.dump(data, sys.stdout)
    

    which gives:

    [long, scalar]: This is just a filler to show that the default width is 80 chars
    foo: >
      abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
      abcdefghi abcdefghi abcdefghi end
    

    Although you can create such a folded string from scratch, it is not trivial (there is no API, and the internal representation might change). What I recommend is just creating the folded string data and then loading it by defining your folded_str differntly:

    import sys
    import ruamel.yaml
    
    yaml = ruamel.yaml.YAML()
    
    def folded_str(s, pos=70):
        parts = []
        r = ""
        for part in s.split(' '):
            if not r:
                r = part
            elif len(r) + len(part) >= pos:
                parts.append(r + '\n')
                r = part
            else:
                r += ' ' + part
        parts.append(r)
        return yaml.load(">\n" + "".join(parts))
    
    data = {
        'foo': folded_str(('abcdefghi ' * 10) + 'end\n'),
    }
    
    yaml.dump(data, sys.stdout)
    

    which gives:

    foo: >
      abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi abcdefghi
      abcdefghi abcdefghi abcdefghi end