Search code examples
pythongoogle-sheetsgoogle-sheets-apigspread

Python code with gspread takes too long (sometimes)


I´ve recently developed a code to recover tables across 5 very loaded google sheets using it´s url and the table´s title (which´s range is protected for this not to fail). The tables are not normalized, and there can be from 1 to 5 per sheet. Sometimes it starts on A3, sometimes on B5, etc, that´s why we use the title to retrieve them. I think it´s best to just post the bit of code that is taking too long, and comment you through it:

#Finds the sheet where the table is located (info stored in a BQ table)
hoja = query_select_keys_tablas(columnas='NOMBRE_HOJA',where='NOMBRE_TABLA',filtro=hoja_key).iloc[0,0]

gc = gspread.authorize(credentials=credentials)

#This is one of the lines that takes a bit of time, but it is understandable
ws = gc.open_by_key(url_pais).worksheet(hoja)

#LINE 1: One of three lines that takes over 10 seconds to run (depending on url and table).
table_location = ws.find(hoja_key)

table_start_row = table_location.row + 1
table_start_col = table_location.col

#LINE 2: Another line that takes 10 seconds. This is needed to understand where the table ends.
columns_in_sheet = ws.row_values(table_start_row)

#This step is needed because there are tables next to each other, but always separated by one empty column, so this way we can know where one starts and another ends.
columns_in_sheet = columns_in_sheet[table_start_col-1:]
try:
  table_end_col = columns_in_sheet.index('') + table_start_col - 1
except ValueError:
  table_end_col = len(columns_in_sheet) + table_start_col - 1

table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}'

#LINE 3: This takes the longest I think.
table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')

df = pd.DataFrame(table)
df.columns = df.iloc[0]
df = df[1:]
df = df[df[df.columns[0]] != '']

So first off, am I doing something wrong? Maybe missing a parameter that would make all the gspread methods way faster. It´s super odd because sometimes the same code that took 1 minute to run will take 4 seconds, and it seems super random. The only pattern I recognize is that the first time running the code with a pair of variables is never fast, but the next ones might be faster or not.

If there is nothing I can do about it, is there any way I can make this code more time efficient by removing or changing one of the three marked lines? For example, I had a 4th line that would retrieve the last row, the same way I get the last column. However, I found that by just specifying the last column and then filtering out all the empty cells at the end I saved more time.

While 1 or 2 minutes seem like little time, this is just one part of a bigger script, where the analyst who will run the whole thing might have to run it 20 times in one day.

Any ideas will be appreciated!


Solution

  • Modification points:

    • When I saw your showing script for retrieving table, it seems that 3 requests are done to Sheets API as follows. I'm worried that this might bring an increase in the cost. I think that in this case, the number of requests can be reduced.

      1. table_location = ws.find(hoja_key)
      2. columns_in_sheet = ws.row_values(table_start_row)
      3. table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')
    • Although I'm not sure whether this is related to the increase in the cost, as one more modification, I declared both the start of row and column and the end of row and column.

    When these points are reflected in your script, how about the following modification?

    From:

    #LINE 1: One of three lines that takes over 10 seconds to run (depending on url and table).
    table_location = ws.find(hoja_key)
    
    table_start_row = table_location.row + 1
    table_start_col = table_location.col
    
    #LINE 2: Another line that takes 10 seconds. This is needed to understand where the table ends.
    columns_in_sheet = ws.row_values(table_start_row)
    
    #This step is needed because there are tables next to each other, but always separated by one empty column, so this way we can know where one starts and another ends.
    columns_in_sheet = columns_in_sheet[table_start_col-1:]
    try:
      table_end_col = columns_in_sheet.index('') + table_start_col - 1
    except ValueError:
      table_end_col = len(columns_in_sheet) + table_start_col - 1
    
    table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}'
    
    #LINE 3: This takes the longest I think.
    table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')
    

    To:

    values = ws.get_all_values()
    searched = [{"row": i, "col": r.index(hoja_key)} for i, r in enumerate(values) if hoja_key in r]
    if searched != []:
        table_start_row = searched[0]["row"] + 2
        table_end_row = len(values)
        table_start_col = searched[0]["col"] + 1
        table_end_col = max([len(r) for r in values])
        table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}{table_end_row}'
        table = ws.get(table_range, value_render_option='UNFORMATTED_VALUE', date_time_render_option='FORMATTED_STRING')
    
    • When this script is run, Sheets API is requested 2 times, and only the data range is retrieved by table_range. I guessed that this might be able to reduce the cost of the script a little.