Search code examples
pythonpython-3.6luigi

Passing Python objects between Tasks in Luigi?


I was coding my first project in Python 3.6 using Spotify's Luigi to arrange some Natural Language Processing Tasks in a pipeline.

I noticed that the output() function of a Task class always returns some kind of Target object, which is just some file somewhere, be it local or remote. Because my Tasks produce more complex data structures like parse trees, it's pretty awkward for me to write them into files as strings and read them again after.

Therefore I would like to ask if there is any possibility to pass Python objects between the tasks within a pipeline?


Solution

  • Short answer: No.

    Luigi parameters are limited to date/datetime objects, string, int and float. See docs for reference.

    That means that you need to serialize your complex data structure as a string (using json, msgpack, whatever serializer you like, and even compress it) and pass it as a string parameter.

    Of course, you may write a custom Parameter subclass, but you'll need to implement the serialize and parse methods basically.

    But take into account: if you use parameters instead of saving your calculated data to a target, you will be loosing one key advantage of using Luigi: if the parent task in the tree fails more than the count of retries you specify, then you´ll need to run the task that calculates that complex data structure again. If your tasks calculates complex data or takes a considerable amount of time or consumes a lot of resources, then you should save the output as a target in order to not having to do all that expensive computation again.

    And looking beyond: another task may need that data too, so why not save it?

    Also, notice that targets are not only files: you may save your data to a database table, Redis, Hadoop, an Elastic Search index, and many more: http://luigi.readthedocs.io/en/stable/api/luigi.contrib.html#submodules