Search code examples
pythonpackagehuggingface-transformershuggingface-datasets

install python huggingface datasets package without internet connection from python environment


I dont have access to internet connection from my python environment. I would like to install this library

I also noticed this page which has files required for the package. I installed that package by coping that file to my python environment and then running the below code

pip install 'datasets_package/datasets-1.18.3.tar.gz'
Successfully installed datasets-1.18.3 dill-0.3.4 fsspec-2022.1.0 multiprocess-0.70.12.2 pyarrow-6.0.1 xxhash-2.0.2

But when I try the below code

import datasets
datasets.load_dataset('imdb', split =['train', 'test'])

it throws error ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py (error 403)

I can access the file https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py from outside my python enviroment

what files should I copy and what other code changes that I should make so that this line would work datasets.load_dataset('imdb', split =['train', 'test']) ?

#Update 1=====================

I followed below suggestions and copied below files within my python environment. So

os.listdir('huggingface_imdb_data/')
['dummy_data.zip',
 'dataset_infos.json',
 'imdb.py',
 'README.md',
 'aclImdb_v1.tar.gz']

The last file comes from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz and the other files come from github.com/huggingface/datasets/tree/master/datasets/imdb

Then I tried

import datasets
#datasets.load_dataset('imdb', split =['train', 'test'])
datasets.load_dataset('huggingface_imdb_data/aclImdb_v1.tar.gz')

but i get the below error :(

HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/huggingface_imdb_data/aclImdb_v1.tar.gz?full=true

I also tried

datasets.load_from_disk('huggingface_imdb_data/aclImdb_v1.tar.gz')

but get the error

FileNotFoundError: Directory huggingface_imdb_data/aclImdb_v1.tar.gz is neither a dataset directory nor a dataset dict directory.

Solution

  • Unfortunately the method 1 not working because not yet supported: https://github.com/huggingface/datasets/issues/761

    Method 1.: You should use the data_files parameter of the datasets.load_dataset function, and provide the path to your local datafile. See the documentation: https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset

    datasets.load_dataset
    Parameters
    ...
    data_dir (str, optional) – Defining the data_dir of the dataset configuration.
    data_files (str or Sequence or Mapping, optional) – Path(s) to source data file(s).
    ...
    

    Update 1.: You should use something like this:

    datasets.load_dataset('imdb', split =['train', 'test'], data_files='huggingface_imdb_data/aclImdb_v1.tar.gz')
    

    Method 2.:

    Or check out this discussion: https://github.com/huggingface/datasets/issues/824#issuecomment-758358089

    >here is my way to load a dataset offline, but it requires an online machine
    
    (online machine)
    
        import datasets
        data = datasets.load_dataset(...)
        data.save_to_disk('./saved_imdb')
    
    >copy the './saved_imdb' dir to the offline machine
    
    (offline machine)
    
        import datasets
        data = datasets.load_from_disk('./saved_imdb')