Search code examples
pythongoogle-sheetsgoogle-apigoogle-drive-apigoogle-api-python-client

How to obtain contents of previous versions of google sheets using Python and Google APIs


I have various google sheets for which I need to retrieve the historic versions of. These historic sheets pertain to the status of products and will be appended to a Google Big Query table. As such, it is important that I be able to access the actual contents of these old sheets and not just their metadata.

I have attempted this problem with the Python code below. In this code, I have been able to setup a service with the proper credentials. I am then able to get historic versions in the variable revisions which is a list of dictionaries that look like this

{'id': '15104',
 'mimeType': 'application/vnd.google-apps.spreadsheet',
 'kind': 'drive#revision',
 'modifiedTime': '2023-06-27T12:41:52.305Z'}

This is where I then get stuck. I am not able to download or retrieve the content of this historic version of the file. I typically get an error that complains about only being able to download binary files:

HttpError: <HttpError 403 when requesting https://www.googleapis.com/drive/v3/files/1D1pkeTUDoGZnlHHQh0AiRvFAippyX4OYRWR4XNx3leU/revisions/15098?alt=media returned "Only files with binary content can be downloaded. Use Export with Docs Editors files.". Details: "[{'message': 'Only files with binary content can be downloaded. Use Export with Docs Editors files.', 'domain': 'global', 'reason': 'fileNotDownloadable', 'location': 'alt', 'locationType': 'parameter'}]">

Please help me to understand how to access the contents of the historic files. I am also aware that it might not be possible. If so, please do let me know about such limitations. Thank you for your time.

import os.path
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build



SCOPES = [
    'https://www.googleapis.com/auth/drive',
    'https://www.googleapis.com/auth/drive.file',
    'https://www.googleapis.com/auth/spreadsheets',
]

def login():
    creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'bom_files.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    service = build('drive', 'v3', credentials=creds)
    # Call the Drive v3 API
    return service


def get_sheet_revisions(sheet_id,service):
    revisions = service.revisions().list(fileId=sheet_id).execute().get('revisions')
    revised_file_contents = []  # contents of revised files
    for revision in revisions:
        request = service.revisions().get_media(fileId=sheet_id,
                                                revisionId=revision['id'])
        file_contents = request.execute()
        # Do something with the file like save it.
        # For now, lets append it to a list
        revised_file_contents.append(file_contents)
    return revised_file_contents

if __name__ == '__main__':
    service = login()
    historic_sheets = get_sheet_revisions(sheet_id,service)

EDIT

I have also tried the following. It actually downloads something but it is an unreadable mess. Google sheets cannot even open the xlsx file that it creates. On a positive note, it does give a url request code of 200.

import os.path
import gspread
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
import requests


SCOPES = ['https://www.googleapis.com/auth/drive', 'https://www.googleapis.com/auth/spreadsheets']

def login():
    creds = None
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            creds = Credentials.from_service_account_file('bom_files.json', scopes=SCOPES)
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    return creds


def export_sheet_revision(sheet_id, revision_id, export_format):
    creds = login()
    client = gspread.authorize(creds)
    sheet = client.open_by_key(sheet_id)
    url = f"https://docs.google.com/spreadsheets/export?id={sheet_id}&revision={revision_id}&exportFormat={export_format}"
    return sheet, url


def download_file(url, output_path):
    response = requests.get(url)
    with open(output_path, 'wb') as file:
        file.write(response.content)


if __name__ == '__main__':
    sheet_id = '1D1pkeTUDoGZnlHHQh0AiRvFAippyX4OYRWR4XNx3leU'
    sheet_id = '1wl7kLGLAgCnFB0dn7JYubO-ZwnK5-s-4Rxq-mQtRRC8'  # simpler sheet
    revision_id = '15098'
    export_format = 'xlsx'
    
    sheet, download_url = export_sheet_revision(sheet_id, revision_id, export_format)
    worksheets = sheet.worksheets()

    for worksheet in worksheets:
        worksheet_title = worksheet.title
        worksheet_url = download_url + f'&gid={worksheet.id}'
        output_path = f'output_{worksheet_title}.xlsx'  # Specify the desired output file path for each worksheet
        
        download_file(worksheet_url, output_path)
        print(f"Worksheet '{worksheet_title}' downloaded to: {output_path}")


Solution

  • I thought that the endpoint for exporting Google Spreadsheet in XLSX format with the specific revision ID can be simply created. When this is reflected in a sample script, how about the following sample script?

    Sample script:

    In this case, creds of creds.token is from creds of service = build('drive', 'v3', credentials=creds).

    spreadsheet_id = "###" # Please set your Spreadsheet ID.
    revision_id = "###" # Please set your revision ID.
    type = "xlsx"
    
    url = f"https://docs.google.com/spreadsheets/export?id={spreadsheet_id}&revision={revision_id}&exportFormat={type}"
    res = requests.get(url, headers={"Authorization": "Bearer " + creds.token})
    with open('sample.xlsx', 'wb') as f:
        f.write(res.content)
    

    When this script is run, the Google Spreadsheet is exported in XLSX format with the specific revision ID and it is saved as a file. In the above sample script, Google Spreadsheet can be used. When you want to know other exported mimeTypes, you can use the following sample script.

    spreadsheet_id = "###" # Please set your Spreadsheet ID.
    revision_id = "###" # Please set your revision ID.
    
    service = build("drive", "v3", credentials=creds)
    obj = service.revisions().get(fileId=spreadsheet_id, revisionId=revision_id, fields="*").execute()
    urls = obj.get("exportLinks")
    print(urls)
    

    Note:

    • In the above script, Google Docs files (Documents, Spreadsheets, Slides, and so on) can be used. For example, when the files except for Google Docs files, the following script can be used. This cannot be used for Google Docs files. Please be careful about this. I thought that this might be the reason for your 1st issue.

      file_id = "###" # Please set your file ID.
      revision_id = "###" # Please set your revision ID.
      
      service = build("drive", "v3", credentials=creds)
      request = service.revisions().get_media(fileId=file_id, revisionId=revision_id)
      fh = io.FileIO("sample filename", mode='wb')
      f = MediaIoBaseDownload(fh, request)
      done = False
      while done is False:
          status, done = f.next_chunk()
          print('Download %d%%.' % int(status.progress() * 100))
      

    Reference: