Search code examples
pythonpytorchgoogle-colaboratoryspacytorchtext

"No such file" when loading csv data stored in G drive to torchtext format using torchtext.data.TabularDataset,


I have stored a csv file in G drive and try to load it to torchtext data.TabularDataset. The error message is "FileNotFoundError: [Errno 2] No such file or directory: 'https://.....'"

Is it impossible to load csv file from g drive directly to torchtext TabularDataset?

Here is the code. I have also made a public colab notebook with data publicly available.

import torch
from torchtext import data, datasets

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False)  
LABEL = data.LabelField(sequential=False, dtype = torch.float) 

train = data.TabularDataset(path = 'https://drive.google.com/open?id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta', 
                            format = 'csv', 
                            fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)], 
                            skip_header=False)

Solution

  • Let's assume you can afford to download this CSV file. I would suggest you to use a functionally built-in on torchtext: download_from_url.

    import os
    import torch
    from torchtext import data, datasets
    from torchtext.utils import download_from_url
    
    # download the file
    CSV_FILENAME = 'data.csv'
    CSV_GDRIVE_URL = 'https://drive.google.com/uc?export=download&id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta'
    download_from_url(CSV_GDRIVE_URL, CSV_FILENAME)
    
    TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False)  #from torchtext import data
    LABEL = data.LabelField(sequential=False, dtype = torch.float) 
    
    # if you're on Colab, you'll need this /content
    train = data.TabularDataset(path=os.path.join('/content', CSV_FILENAME),
                                format='csv',
                                fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)],
                                skip_header=False )
    

    Notice that the Google Drive link should not be the one with open?id, but change it to uc?export=download&id.