Is there a way to get log the descriptive stats of a dataset using MLflow?

Is there a way to get log the descriptive stats of a dataset using MLflow? If any could you please share the details?

Solution

Generally speaking you can log arbitrary output from your code using the mlflow_log_artifact() function. From the docs:

mlflow.log_artifact(local_path, artifact_path=None) Log a local file or directory as an artifact of the currently active run.

Parameters:
local_path – Path to the file to write. artifact_path – If provided, the directory in artifact_uri to write to.

As an example, say you have your statistics in a pandas dataframe, stat_df.

## Write csv from stats dataframe
stat_df.to_csv('dataset_statistics.csv')

## Log CSV to MLflow
mlflow.log_artifact('dataset_statistics.csv')

This will show up under the artifacts section of this MLflow run in the Tracking UI. If you explore the docs further you'll see that you can also log an entire directory and the objects therein. In general, MLflow provides you a lot of flexibility - anything you write to your file system you can track with MLflow. Of course that doesn't mean you should. :)