Search code examples
pandasgoogle-cloud-storage

Using pandas to open Excel files stored in GCS from command line


The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:

from google.cloud import storage


def list_blobs(bucket_name):
    """Lists all the blobs in the bucket."""
    # bucket_name = "your-bucket-name"

    storage_client = storage.Client()

    # Note: Client.list_blobs requires at least package version 1.17.0.
    blobs = storage_client.list_blobs(bucket_name)

    for blob in blobs:
        print(blob.name)

list_blobs('sn_project_data')

No from the command line I can run:

$ python path/file.py

And in my terminal the files in said bucket are printed out. Great, it works! However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:

df = pd.read_excel(filename)
print(df.iloc[0])

However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...

What command(s) should I run?

Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.


Solution

  • read_excel() does not support google cloud storage file path as of now but it can read data in bytes.

    pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, storage_options=None)

    Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object

    What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.

    Download the contents of this blob as a bytes object.

    For this example I just used a random sample xlsx file and read the 1st sheet:

    from google.cloud import storage
    import pandas as pd
    
    bucket_name = "your-bucket-name"
    blob_name = "SampleData.xlsx"
    
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    data_bytes = blob.download_as_bytes()
    
    df = pd.read_excel(data_bytes)
    print(df)
    

    Test done: enter image description here