I am trying to get specific column data out of specific sheets in one big google sheet. For example, I have a list of sheets
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
and from these sheets, I want to retrieve specific columns like
Column_headers = ['A', 'B']
What I am doing right now is getting the data from
import gspread
from gspread_dataframe import set_with_dataframe
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', None)
sa = gspread.service_account(filename='file.json')
book = sa.open("book")
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']
for i in Sheets:
2022_sheet = book.worksheet(i)
records = 2022_sheet.get_all_records()
data_2022 = zip(*(e for e in zip(*record) if e[0] in Column_headers))
getdata_2022 = pd.DataFrame(data_2022, columns = Column_headers)
print(getdata_2022)
I am getting the following error
GSpreadException: the given 'expected_headers' are not uniques
that is because the headers are not unique (obv) that's why I am retrieving specific columns, also I can't understand the bit where I loop through the 'Sheets' to get the data only from specific sheets. Eventually the end result should be two columns 'A' and 'B' with all the data from the specific 3 sheets.
IIUC try doing the following:
import gspread
import gspread_dataframe as gd
import pandas as pd
# connect to the service account
sa = gspread.service_account(filename='file.json')
# open the Google Sheet
book = sa.open('book')
# The list of sheet names and columns that you want
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']
# list comprehension using gd.get_as_dataframe
# to create a list of DataFrames for each sheet
# then you concat all frames together into one DataFrame
df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers]
for sheet in Sheets])
You may need to filter out NaNs
if you have empty rows in your Google Sheet. Something like:
df = df[~df.isna().all(axis=1)]
replace df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers] for sheet in Sheets])
with the following
dfs = []
for sheet in Sheets:
df = gd.get_as_dataframe(book.worksheet(sheet))
try:
df = df[Column_headers]
except KeyError:
df = df[['A']]
df['B'] = np.nan
# the commented out code below will add the sheet name to each frame
# df['Sheet_Name'] = sheet
# you can also remove all nan rows here if you need
# df = df[~df.isna().all(axis=1)]
dfs.append(df)
final_df = pd.concat(dfs)