Search code examples
pythonyamljinja2

Referencing specific property of YAML anchor


I am trying to create a template that consists of a single, human-readable file of data values that will be used to populate several source files. Right now, I'm using YAML and a short Python script with Jinja2 to insert the YAML values into template source files, but it's important that the names of how these things are used be preserved, as there are hundreds of parameters that have complex interrelationships.

I'm trying to reference the same data value in multiple places under different names, and trying to use YAML anchors and aliases to do so.

Basically I want this:

compute:
  gpus: 4
  cpus: 120
  mem: 900Gi

torchrun:
  nproc_per_node: [compute.gpus]
  ...

resource_requests:
  gpus: [compute.gpus]
  ...

I've tried this:

compute: &compute
  gpus: 4

torchrun:
  nproc_per_node: *compute.gpus

...but apparently it isn't valid syntax in PyYAML, and I've seen conflicting things online as to whether it's valid YAML at all.

The end result will be multiple files, for example:

resources.py:

@task(
   gpus = {{ resource_requests.gpus }}
)
def my_task():
  ...

and run.sh:

torchrun --nproc_per_node={{ torchrun.nproc_per_node }} ... 

I am aware that with Jinja2 I could simply reuse compute.gpu in the templates for both files, but I want to keep the names consistent so that it's easier to reason about what the template values are actually doing (again, there are dozens of such values that get used in different places but must be consistent).


Solution

  • The dot is a normal character that can be part of an anchor/alias You can use all printable non-whitespace characters, except the flow indicators ( ,[]{} ), in an anchor.

    So if you *compute.gpus in your YAML document, it has to be preceded by an anchor &compute.gpus.

    You should always refer to the YAML 1.2 spec as to what is valid YAML, and not rely on what PyYAML can, or cannot, parse, as it only supports a subset of the YAML 1.1 specification (outdated in 2009).

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    compute:
      gpus: &compute.gpus 4
    
    torchrun:
      nproc_per_node: *compute.gpus
    """
        
    yaml = ruamel.yaml.YAML()
    data = yaml.load(yaml_str)
    print(data['torchrun']['nproc_per_node'])
    

    which prints:

    4