Search code examples
pythongoogle-colaboratory

How to read data in Google Colab from my Google drive?


I have some data on gDrive, for example at /projects/my_project/my_data*.

Also I have a simple notebook in gColab.

So, I would like to do something like:

for file in glob.glob("/projects/my_project/my_data*"):
    do_something(file)

Unfortunately, all examples (like this - https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb, for example) suggests to only mainly load all necessary data to notebook.

But, if I have a lot of pieces of data, it can be quite complicated. How to solve this issue?


Solution

  • Good news, PyDrive has first class support on CoLab! PyDrive is a wrapper for the Google Drive python client. Here is an example on how you would download ALL files from a folder, similar to using glob + *:

    !pip install -U -q PyDrive
    import os
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive
    from google.colab import auth
    from oauth2client.client import GoogleCredentials
    
    # 1. Authenticate and create the PyDrive client.
    auth.authenticate_user()
    gauth = GoogleAuth()
    gauth.credentials = GoogleCredentials.get_application_default()
    drive = GoogleDrive(gauth)
    
    # choose a local (colab) directory to store the data.
    local_download_path = os.path.expanduser('~/data')
    try:
      os.makedirs(local_download_path)
    except: pass
    
    # 2. Auto-iterate using the query syntax
    #    https://developers.google.com/drive/v2/web/search-parameters
    file_list = drive.ListFile(
        {'q': "'1SooKSw8M4ACbznKjnNrYvJ5wxuqJ-YCk' in parents"}).GetList()
    
    for f in file_list:
      # 3. Create & download by id.
      print('title: %s, id: %s' % (f['title'], f['id']))
      fname = os.path.join(local_download_path, f['title'])
      print('downloading to {}'.format(fname))
      f_ = drive.CreateFile({'id': f['id']})
      f_.GetContentFile(fname)
    
    
    with open(fname, 'r') as f:
      print(f.read())
    

    Notice that the arguments to drive.ListFile is a dictionary that coincides with the parameters used by Google Drive HTTP API (you can customize the q parameter to be tuned to your use-case).

    Know that in all cases, files/folders are encoded by id's (peep the 1SooKSw8M4ACbznKjnNrYvJ5wxuqJ-YCk) on Google Drive. This requires that you search Google Drive for the specific id corresponding to the folder you want to root your search in.

    For example, navigate to the folder "/projects/my_project/my_data" that is located in your Google Drive.

    Google Drive

    See that it contains some files, in which we want to download to CoLab. To get the id of the folder in order to use it by PyDrive, look at the url and extract the id parameter. In this case, the url corresponding to the folder was:

    https://drive.google.com/drive/folders/1SooKSw8M4ACbznKjnNrYvJ5wxuqJ-YCk

    Where the id is the last piece of the url: 1SooKSw8M4ACbznKjnNrYvJ5wxuqJ-YCk.