Search code examples
pythonyamlpyyaml

How to replace many identical values in a YAML file


I am currently building a python application that uses YAML configs. I generate the YAML config file by using other YAML files. I have a "template" YAML, which defines the basic structure I want in the YAML file the app uses, and then many different "data" YAMLs that fill in the template to spin the application's behavior a certain way. So for example say I had 10 "data" YAMLs. Depending on where the app is being deployed, 1 "data" YAML is chosen, and used to fill out the "template" YAML. The resulting filled out YAML is what the application uses to run. This saves me a ton of work. I have run into a problem with this method though. Say I have a template YAML that looks like this:

id: {{id}}
endpoints:
  url1: https://website.com/{{id}}/search
  url2: https://website.com/foo/{{id}}/get_thing
  url3: https://website.com/hello/world/{{id}}/trigger_stuff
foo:
  bar:
    deeply:
      nested: {{id}}

Then somewhere else, I have like 10 "data" YAMLs each with a different value for {{id}}. I cant seem to figure out an efficient way to replace all these {{id}} occurrences in the template. I am having a problem because sometimes the value to be substituted is a substring of a value I want to mostly keep, or the occurrences are very far apart from each other in the hierarchy, making looping solutions inefficient. My current method for generating the config file using template+data looks something like this in python:

import yaml
import os

template_yaml = os.path.abspath(os.path.join(os.path.dirname(__file__), 'template.yaml'))
# In this same folder you would find flavor2, flavor3, flavor4, etc, lets just use 1 for now
data_yaml = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data_files', 'flavor1.yaml'))
# This is where we dump the filled out template the app will actually use
output_directory = os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir))

with open(template_yaml, 'r') as template:
    try:
        loaded_template = yaml.load(template)  # Load the template as a dict
        with open(data_yaml , 'r') as data:
            loaded_data= yaml.load(data)  # Load the data as a dict
        # From this point on I am basically just setting individual keys from "loaded_template" to values in "loaded_data"
        # But 1 at a time, which is what I am trying to avoid:
        loaded_template['id'] = loaded_data['id']
        loaded_template['endpoints']['url1'] = loaded_template['endpoints']['url1'].format(loaded_data['id'])
        loaded_template['foo']['bar']['deeply']['nested'] = loaded_data['id']

Any idea on how to go through and change all the {{id}} occurrences faster?


Solution

  • You are proposing to us PyYAML, but it is not very suited for doing updates on YAML files. In that process, if it can load your file in the first place, you loose your mapping key order, any comments you have in the file, merges get expanded, and any special anchor names get lost in translation. Apart from that PyYAML cannot deal with the latest YAML spec (released 9 years ago), and it can only handle simple mapping keys.

    There are two main solutions:

    • You can use substitution on the raw file
    • You an use ruamel.yaml and recurse into the data structure

    Substitution

    If you use substition you can do that in much more efficient way than the line by line substittution that @caseWestern proposes. But most of all, you should harden the scalars in which these substitutions take place. Currently you have plain scalars (i.e. flow style scalars without quotes) and those tend to break if you insert things like #, : and other syntactically significant elements.

    In order to prevent that from happening rewrite your input file to use block style literal scalars:

    id: {{id}}
    endpoints:
      url1: |-
        https://website.com/{{id}}/search
      url2: |-
        https://website.com/foo/{{id}}/get_thing
      url3: |-
        https://website.com/hello/world/{{id}}/trigger_stuff
    foo:
      bar:
        deeply:
          nested: |-
            {{id}}
    

    If the above is in alt.yaml you can do:

    val = 'xyz'
    
    with open('alt.yaml') as ifp:
        with open('new.yaml', 'w') as ofp:
           ofp.write(ifp.read().replace('{{id}}', val))
    

    to get:

    id: xyz
    endpoints:
      url1: |-
        https://website.com/xyz/search
      url2: |-
        https://website.com/foo/xyz/get_thing
      url3: |-
        https://website.com/hello/world/xyz/trigger_stuff
    foo:
      bar:
        deeply:
          nested: |-
            xyz
    

    ruamel.yaml

    Using ruamel.yaml (disclaimer: I am the author of that package), you don't have to worry about breaking the input by syntactically significant replacement texts. If you do so, then the output will automatically be correctly quoted. You do have to take care that your input is valid YAML, and by using something like {{ that, at the beginning of a node indicates two nested flow-style mappings, you'll run into trouble.

    The big advantage here is that your input file is loaded, and it is checked to be correct YAML. But this is significantly slower than file level substitution.

    So if your input is in.yaml:

    id: <<id>>  # has to be unique
    endpoints: &EP
      url1: https://website.com/<<id>>/search
      url2: https://website.com/foo/<<id>>/get_thing
      url3: https://website.com/hello/world/<<id>>/trigger_stuff
    foo:
      bar:
        deeply:
          nested: <<id>>
        endpoints: *EP
        [octal, hex]: 0o123, 0x1F
    

    You can do:

    import sys
    import ruamel.yaml
    
    def recurse(d, pat, rep):
        if isinstance(d, dict):
            for k in d:
                if isinstance(d[k], str):
                    d[k] = d[k].replace(pat, rep)
                else:
                   recurse(d[k], pat, rep)
        if isinstance(d, list):
            for idx, elem in enumerate(d):
                if isinstance(elem, str):
                    d[idx] = elem.replace(pat, rep)
                else:
                   recurse(d[idx], pat, rep)
    
    
    yaml = ruamel.yaml.YAML()
    yaml.preserve_quotes = True
    with open('in.yaml') as fp:
        data = yaml.load(fp)
    recurse(data, '<<id>>', 'xy: z')  # not that this makes much sense, but it proves a point
    yaml.dump(data, sys.stdout)
    

    which gives:

    id: 'xy: z' # has to be unique
    endpoints: &EP
      url1: 'https://website.com/xy: z/search'
      url2: 'https://website.com/foo/xy: z/get_thing'
      url3: 'https://website.com/hello/world/xy: z/trigger_stuff'
    foo:
      bar:
        deeply:
          nested: 'xy: z'
        endpoints: *EP
        [octal, hex]: 0o123, 0x1F
    

    Please note:

    • The values that have the replacement pattern, are automatically quoted on dump, to deal with the : + space that would otherwise indicate a mapping and break the YAML

    • the YAML.load() method, contrary to PyYAML's load function, is safe (i.e. cannot execute arbitrary Python by manipulating the input file.

    • The comment, the octal and hexadecimal integer and the alias name is preserved.

    • PyYAML cannot load the file in.yaml at all, although it is valid YAML

    • The above recurse, only changes the input mapping values, if you want to do the keys as well, you either have to pop and reinsert all the keys (even if not changed), to keep the original order, or you need to use enumerate and d.insert(position, key, value). If you have merges, you also cannot just walk over the keys, you'll have to walk over the non-merged keys of the "dict".