Search code examples
uploaddvcdagshub

Adding data using dagshub.upload.Repo(USER_NAME,REPO_NAM)


I want to add a raw dataset file to my dagshub repo (my first repo, and its being used alongside an MLflow tutorial)

This is the line that is giving me trouble:

repo = dagshub.upload.Repo(USER_NAME,REPO_NAME)

repo.upload(local_path='data/winequality.txt',
            remote_path='data/raw/winequality.txt',
            commit_message='Added Raw Data',
            versioning='dvc')

and this is the error I get:

Uploading files (1) to "USER_NAME/REPO_NAME"...
---------------------------------------------------------------------------
DagsHubAPIError                           Traceback (most recent call last)
<ipython-input-49-e8d1e8493248> in <cell line: 4>()
      2 repo = dagshub.upload.Repo(USER_NAME,REPO_NAME)
      3 
----> 4 repo.upload(local_path='data/winequality.txt',
      5             remote_path='data/raw/winequality.txt',
      6             commit_message='Added Raw Data',

2 frames
/usr/local/lib/python3.10/dist-packages/dagshub/upload/wrapper.py in upload(self, local_path, commit_message, remote_path, **kwargs)
    286         else:
    287             file_to_upload = DataSet.get_file(str(local_path), remote_path)
--> 288             self.upload_files([file_to_upload], commit_message=commit_message, **kwargs)
    289 
    290     def upload_files(

/usr/local/lib/python3.10/dist-packages/dagshub/upload/wrapper.py in upload_files(self, files, directory_path, commit_message, versioning, new_branch, last_commit, force)
    375             timeout=None,
    376         )
--> 377         self._log_upload_details(data, res, files)
    378 
    379         # The ETag header contains the hash of the uploaded commit,

/usr/local/lib/python3.10/dist-packages/dagshub/upload/wrapper.py in _log_upload_details(self, data, res, files)
    413             log_message(f"Got unknown successful status code {res.status_code}")
    414         else:
--> 415             raise determine_upload_api_error(res)
    416 
    417     def _poll_mirror_up_to_date(self):

DagsHubAPIError: file missing from storage:
Required resource is missing from the storage, is '' stored in your repository DagsHub storage?

The Repo file structure looks like this:
Local disk:
root/
  |...data/
    |... winequality.txt

Remote:
root/
  |...data/
     |...raw/

Note that 'raw' is version controlled by DVC, but the dagshub documentation shows that this is the way to do it: Upload Data

Not sure what I am missing.


Solution

  • The issue seems to be caused due to missing DVC tracked files which prevent adding new files to the directory. To solve the issue, run the following code:

    pip install dvc "dvc[s3]" if not already installed.

    git clone https://dagshub.com/<user_name>/<repo_name>.git
    cd <repo_name>
    
    dvc remote add origin --local s3://dvc
    dvc remote modify origin --local endpointurl https://dagshub.com/<user_name>/<repo_name>.s3
    
    dvc remote modify origin --local access_key_id <your_token>
    dvc remote modify origin --local secret_access_key <your_token>
    

    Then once things are configured, run the following:

    mkdir -p data/raw
    dvc commit data/raw.dvc
    dvc push -r origin
    

    Then run your code. It will now work!

    That being said this is probably something we can improve on our end too, so I'll share it with the engineering team!

    Thanks for the question :)