Search code examples
machine-learningmlflowdvcwandbclearml

Experiment tracking for multiple ML independent models using WandB in a single main evaluation


Can you recommend from your experience about choosing a convenient tracking experiment tool and versioning only "Multi independent models, but one input->multi-models->one output" in order to get single main evaluation and conveniently compare sub-evaluations? see a project example in the diagram. enter image description here

I understand and tried to use W&B, MLFlow, DVC, Neptune.ai, DagsHub, TensorBoard for only one model, but I'm not sure one is convenient to use for multi-independent models. I also did not find it in Google for the approximate phrase "ML tracking experiment and management for multi models"


Solution

  • Disclaimer: I'm co-founder at Iterative, we are authors of DVC. My response doesn't come from my experience with all the tools mentioned above. I took this as an opportunity to try build a template for this use case in the DVC ecosystem and share this in case it's useful for anyone.

    Here is the GitHub repo, I've built (Note: it's a template, not a real ML project, scripts are artificially simplified to show the essence of the multi model evaluation):

    DVC Model Ensemble

    I've put together an extensive README with a few videos of CLI, VS Code, Studio tools.

    The core part of the repo is this DVC pipeline, that "trains" multiple models, collects their metrics, and then runs evaluation stage to "reduce" those metrics into the final one.

    stages:
      train:
        foreach:
          - model-1
          - model-2
        do:
          cmd: python train.py
          wdir: ${item}
          params:
            - params.yaml:
          deps:
          - train.py
          - data
          outs:
          - model.pkl:
              cache: false
          metrics:
          - ../dvclive/${item}/metrics.json:
              cache: false
          plots:
          - ../dvclive/${item}/plots/metrics/acc.tsv:
              cache: false
              x: step
              y: acc
      evaluate:
        cmd: python evaluate.py
        deps: 
        - dvclive
        metrics:
        - evaluation/metrics.json:
            cache: false
    

    It describes how to build and connect different things in the project, also makes the project "runnable" and reproducible. It can scale to any number of models (the first foreach clause).

    Please, let me know if that fits your scenario and/or you have more requirements, happy to learn mode and iterate on it :)