Search code examples
pythonjsonyamlsnakemake

Snakemake: validate nested yaml based on jsonschema


I am making use of the snakemake validate function, which seems heavily based on jsonschema, and it works fine for simple examples, however I am unsure how to proceed for more complex parameter settings.

Let's say I implemented the option for multiple peak callers (e.g. macs2 and genrich). Currently my config.yaml looks like this:

peak_caller:
  - macs2:
      --shift -100 --extsize 200 
  - genrich:
      -y -j

If no peak caller is specified I would like it to default to macs2 with these parameters, and if a anything other than either or both of these two peak callers is specified would like it to fail.

I tried different stuff with enumerators and arrays, but I could never get it to work properly:

$schema: "http://json-schema.org/draft-06/schema#"

description: snakemake-workflows peak calling configuration

properties:
  # peak caller algorithms
  peak_caller:
    description: which peak caller(s) to use. Currently macs2 (default) and genrich are supported.
    type: array
    default: [macs2]

Preferably I would stay in yaml format but I am open to configs written in json.


Solution

  • properties:
      peak_caller:
        type: array
        items:
          anyOf:
            - type: object
              properties:
                macs2: {type: string}
              required: [macs2]
              additionalProperties: false
            - type: object
              properties:
                genrich: {type: string}
              required: [genrich]
              additionalProperties: false
         maxItems: 2
         uniqueItems: true
       default:
          - macs2: --shift -100 --extsize 200
    

    Note however that this schema does not forbid giving either macs2 or genrich two times with different parameters. For all I know, it is not possible to forbid that with the structure you're currently using. However, if the order of the items is not important, you could simply drop the array and use an object like this:

    peak_caller:
      macs2:
        --shift -100 --extsize 200 
      genrich:
        -y -j
    

    Corresponding schema:

    properties:
      peak_caller:
        type: object
        properties:
          macs2: {type: string}
          genrich: {type: string}
        minProperties: 1   # if you want to have at least one
        additionalProperties: false
        default:
          macs2: --shift -100 --extsize 200
    

    By default, JSONSchema does not require values for properties, so this schema is okay with only one option being defined.