I am working on creating an automated script to download files from a FTP and store them into BigQuery.
Problem is that BigQuery accepts only .csv
files. For this reason I am looking for ways to handle also .xlsx
and .xls
files, conditional on that I am planning to put this batch code in the Cloud.
I am mentioning the latter as one way of going about to convert .xlsx
files to .csv
is to use something like:
import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.
Are there therefore better ways to handle .xlsx
loading into BigQuery? Or is this the way to go?
You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".
The general architecture would be:
I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).