Search code examples
data-managementdvc

updating data in dvc registry from other projects


I have a couple of projects that are using and updating the same data sources. I recently learned about dvc's data registries, which sound like a great way of versioning data across these different projects (e.g. scrapers, computational pipelines).

I have put all of the relevant data into data-registry and then I imported the relevant files into the scraper project with:

$ poetry run dvc import https://github.com/username/data-registry raw

where raw is a directory that stores the scraped data. This seems to have worked properly, but then when I went to build a dvc pipeline that outputted data into a file that was already tracked by dvc, I got an error:

$ dvc run -n menu_items -d src/ -o raw/menu_items/restaurant.jsonl scrapy crawl restaurant
ERROR: Paths for outs:                                                
'raw'('raw.dvc')
'raw/menu_items/restaurant.jsonl'('menu_items')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

Can someone help me understand what is going on here? What is the best way to use data registries to share and update data across projects?

I would ideally like to update the data-registry with new data from the scraper project and then allow other dependent projects to update their data when they are ready to do so.


Solution

  • When you import (or add) something into your project, a .dvc file is created with that lists that something (in this case the raw/ dir) as an "output".

    DVC doesn't allow overlapping outputs among .dvc files or dvc.yaml stages, meaning that your "menu_items" stage shouldn't write to raw/ since it's already under the control of raw.dvc.

    Can you make a separate directory for the pipeline outputs? E.g. use processed/menu_items/restaurant.jsonl