python yaml airflow job-scheduling luigi

Job Scheduler - YAML for writing job definition?

In our legacy job scheduling software (built on top of crontab), we are using apache config format (parser) for writing a job definition and we are using perl config general to parse the config files. This software is highly customized and have functionalities like running commands in my job after checking if dependency of that command is met or not, rescheduling jobs in case command fails, supporting custom notifications etc.

We are now planning to rewrite this software in python and considering options like YAML instead of apache config to write job definition. Is YAML good for writing such dynamic configurations?

Example of job definition (run this job at 2AM daily, check if it is tuesday and not holiday in India, if yes reserve my flight and send notification):

// python function to check if it is tuesday
checkIfTuesdayAndNotHoliday()

<job> 
    calendar: indian

        <dependency: arbitrary_python_code: checkIfTuesdayAndNotHoliday()>
        <command>  
            check availability of flight
        </command>

        <success: notify: email: agrawall/>
        <failure: notify: email: ops>
        <command>
            some command to book my flight
        </command>
</job>

<crontab> 0 2 * * * </crontab>

I am struggling to understand what format should I use to define job (YAML, Apache Config, XML, JSON etc). Note that this job definition will be converted to job object inside my python script.

Apache config parser in perl that we currently use https://metacpan.org/source/TLINDEN/Config-General-2.63/General.pm#L769

Apache config parser in python we plan to use https://github.com/etingof/apacheconfig

Solution

Python based config files have at least been around in the form of distutils' setup.py in Python 1.6 (i.e. before 2000). The main disadvantage of using such a format is that it is difficult to update values in the config programmatically. Even if you just want to make some additional utility that analysis these files, you even have to take special care that you can import such a config file without executing code, but also without pulling in all kinds of dependencies via imports. This can be achieved by using if __name__ == '__main__': or more easily by having only the config information as data structure in a file.

So if updating the files is never going to be an issue, then you use Python based data-structures and those are quite readable.

XML and JSON are not good formats for editing by hand. XML has to many < and > to easily type without special tools. JSON has so many double quotes it makes things difficult to read, but it also has all kind of problems because JSON doesn't allow trailing comma's in arrays and objects leading people to write objects like:

{ 
    "a": 1
  , "b": 2
}

This prevents you from deleting the last line and forgetting to remove the comma separating the key/value pairs, but IMO readable is different.

YAML on the other hand can be written very readable, but there are some rules that have to be taken into account when editing the files. In my answer here I show some basic rules that can be included in a YAML file, which editors need to take into account when editing. YAML can be read by other languages than Python (what is difficult to do with Python based config files).

You can use YAML tags (and appropriate Python objects associated with these tags), so you don't have to be dependent on interpreting the key from some key-value pair to understand what the value interprets:

- !Job
  calendar: !Calendar indian
  dependency: !Arbitrary_python_code checkIfTuesdayAndNotHoliday()
  command: !CommandTester
     exec: !Exec check availability of flight
     success: !Commands
       - !Notify 
          email: agrawall
       - !Exec some command to book my flight
     failure: !Commands
       - !Notify 
           email: ops

(at the bottom is a partial example implementation of the classes associated with these tags)

YAML can also be programmatically updated even without the loss of comments, key ordering, tags, when you use ruamel.yaml (disclaimer: I am the author of that package).

I have been parameterizing my Python packaging ( I manage over 100 packages, some of which are on PyPI, other only for specific clients ), for quite some time by reading the configuration parameters for my generic setup.py from each of the package's __init__.py file. I have experimented with inserting a JSON subset of Python, but eventually developed PON (Python Object Notation) which can be easily parsed by the setup.py without importing the __init__.py file with a small (100 line) extension on the AST literal_eval included in the Python standard library.

PON can be used without any library (because it is a subset of the Python datastructures, including dict, list, set, tuple and basic types like integers, floats, booleans, strings, date, datetime. Because it is based on the AST evaluator, you can do calculations ( secs_per_day = 24 * 60 * 60) and other evaluations in your configuration file.

The PON readme also has more detailed description of the advantages (and disadvantages) of that format over YAML, JSON, INI, XML.

The PON package is not needed to use the configuration data, it is only needed if you want to do programmatic round-trips (load-edit-dump) on the PON data.

import sys
from ruamel.yaml import YAML, yaml_object

yaml = YAML()

@yaml_object(yaml)
class CommandTester:
    yaml_tag = u'!CommandTester'

    def __init__(self, exec=None, success=None, failure=None):
        self.exec = exec
        self.success = success
        self.failure = failure

    def __call__(self):
        if self.exec():
            self.success()
        else:
            self.failure()

@yaml_object(yaml)
class Commands:
    """a list of commands"""
    yaml_tag = u'!Commands'

    def __init__(self, commands):
        self._commands = commands  # list of commands to execute

    @classmethod
    def from_yaml(cls, constructor, node):
        for m in yaml.constructor.construct_yaml_seq(node):
            pass
        return cls(m)

    @classmethod
    def to_yaml(cls, representer, node):
        return representer.represent_sequence(cls.yaml_tag, node._commands)

    def __call__(self, verbose=0, stop_on_error=False):
        res = True
        for cmd in self._cmd:
            try:
                res = subprocess.check_output(cmd)
            except Exception as e:
                res = False
                if stop_on_error:
                    break
            return res

@yaml_object(yaml)
class Command(Commands):
    """a single command"""
    yaml_tag = u'!Exec'

    def __init__(self, command):
        Commands.__init__(self, [command])

    @classmethod
    def from_yaml(cls, constructor, node):
        return cls(node.value)

    @classmethod
    def to_yaml(cls, representer, node):
        return representer.represent_scalar(cls.yaml_tag, node._commands[0])


@yaml_object(yaml)
class Notifier:
    yaml_tag = u'!Notify'

with open("job.yaml") as fp:
    job = yaml.load(fp)

yaml.dump(job, sys.stdout)