Search code examples
quartokedro

Include Quarto rendering in kedro pipeline and pass it inputs/outputs


I am using kedro to make some comparative analysis.

I am using the quarto python package providing a wrapper to the quarto cli through the render function. This function will take a qmd file as input and generate a html report from it while computing python chunks.

In a quarto report I have some chunks containing evaluation of output_var1 and output_var2 for example:

plot_function(output_var1)
plot_function(output_var2)

where output_var1 and output_var2 are pandas data frame for example (could be any type of data)

At the end of the pipeline, I would like to compute my report with quarto using the outcome of my pipeline, without saving it to the data catalog.

from quarto import render
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([node(func=function1,
                          inputs='my_input', 
                          outputs="output_var1"),
                     node(func=function2,
                          inputs='my_input',
                          outputs="output_var2"),
                     node(func=render,
                          inputs='params:my_quarto_report', # path to a quatro report *.qmd
                          outputs=None))])

In this example my_input is described in the data catalog but not output_var1 nor output_var2.

The above example fails, because I don't know how to pass output_var1 and output_var2 to quarto. How could this be done? Does quarto have a way to pass complex variables such as dataframe ? I have understand how to pass simple text or numerical variables but I don't see how to pass something which do not fit on the command line.


Solution

  • After some tinkering I managed to reach a decent solution: I cannot pass complex variables directly to quarto, but I can make the node generating the report dependent on some other kedro catalog items by giving them as kwargs to the node calling the quarto render function. Here is an example of a generate_reports kedro pipeline generating a report dependent on output_var which was generate in a different pipeline/node.

    conf/base/catalog.yml:

    output_var_catalog_entry:
      type: pickle.PickleDataSet
      filepath: data/07_model_output/output_var.pkl
    

    conf/base/parameters.yml:

    report_filename: notebooks/report.qmd
    

    notebooks/report.qmd:

    ---
    jupyter: python3
    title: My title
    ---
    
    Some explanations
    
    ```{python}
    import kedro
    
    conf_loader = kedro.config.ConfigLoader('conf')
    conf_catalog = conf_loader.get("catalog.yml")
    catalog = kedro.io.DataCatalog.from_config(conf_catalog)
    output_var = catalog.load("output_var_catalog_entry")
    some_plot(output_var)
    ``` 
    

    src/project_name/pipelines/generate_reports/nodes.py

    from quarto import render
    def generate_report(report: str, **kwargs):
        print("This report depends on:")
        for kw in kwargs:
            print(kw)
        render(report)
    
    

    src/project_name/pipelines/generate_reports/pipeline.py

    def create_pipeline(**kwargs) -> Pipeline:
        return pipeline([node(func=generate_report,
                              inputs={"report": 'params:report_filename',
                                      "output_var": "output_var"},
                              outputs=None,
                              name='generate_report')])