I have a bunch of large csv.gz
files in google cloud storage that we got from an external source. We need to bring this in BigQuery so we can start querying but BigQuery cannot directly ingest CSV GZIPPED files larger than 4GB. So, I decided to convert these files into Parquet format and then load in BigQuery.
Let's take example of the websites.csv.gz
file, which is under path gs://<BUCKET-NAME>/websites/websites.csv.gz
.
Now, for this I wrote a Hive script as below -
CREATE EXTERNAL TABLE websites (
col1 string,
col2 string,
col3 string,
col4 string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'gs://<BUCKET-NAME>/websites/'
TBLPROPERTIES ('skip.header.line.count'='1');
msck repair TABLE websites;
CREATE EXTERNAL TABLE par_websites (
col1 string,
col2 string,
col3 string,
col4 string
) STORED AS PARQUET LOCATION 'gs://<BUCKET-NAME>/websites/par_websites/';
INSERT OVERWRITE TABLE par_websites
SELECT *
FROM websites;
This works well and creates a new folder par_websites as in specified location gs://<BUCKET-NAME>/websites/par_websites/
which has the one parquet file inside it.
But when the website.csv.gz
file is in a subfolder e.g. gs://<BUCKET-NAME>/data/websites/
and I update the script to have read and write locations as gs://<BUCKET-NAME>/data/websites/
and gs://<BUCKET-NAME>/data/websites/par_websites
, it does not work at all. Hive does not seem to read from gs://<BUCKET-NAME>/data/websites/websites.csv.gz
and instead of creating par_websites
folder inside gs://<BUCKET-NAME>/data/websites
, it creates a new folder gs://<BUCKET-NAME>/websites/par_websites
with no parquet file inside.
Why is that and how can I make Hive read and write from subfolders?
Hive was caching my previous table names, so when I was updating it was still showing the older version and not updating.
Once I changed the name and it processed again, all worked well.