Search code examples
export-to-csvdvc

DVC - make scheduled csv dumps


Suppose we got some database (any database, that supports csv dumping), collecting raw data in real time for further usage in ML. On the other side, we got DVC, that can work with csv files.

I want to organize a scheduled run of stored SELECT to that DB with datetime parameters (and also support a manual run), to make a new csv files, and send them to DVC.

In DVC documentation and examples I found, csv file already exists.

Can I make this interaction with database with DVC itself, or I got something wrong, and there is a separate tool for csv dump?


Solution

  • There are 3 steps in this process:

    1. Create a CSV dump. Many DBs have these tools but DVC does not support this natively.
    2. Version the CSV dump and move it to some storage. DVC does this job.
    3. Schedule periodical dump. You can use Cron (easy), AirFlow (not easy) or periodical jobs in GitHub Actions/GitLab CI/CD. Another project from the DVC team can help with CI/CD option: https://cml.dev.