I am trying to load all the csv.gz files from this url to google bigquery. What is the best way to do this?
I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?
I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage. Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?
Thanks for your help!
As @Samuel mentioned, you can use the curl
command to download the files from the URL and then copy the files to GCS bucket.
If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.
Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.