I have a data source in hdfs whose files are NOT compressed, even though their names end with (*.csv.gz), and Impala can't recognize that they are not compressed despite the name. Is there a way to read those files in the external table without the need to change all current files names? and if there is not, what is the best practice to change all current file names in hdfs?
Here is the current creation query for the table:
CREATE EXTERNAL TABLE db.table1(
col1 type,
col2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001'
WITH SERDEPROPERTIES ('field.delim'='\u0001', 'serialization.format'=';')
STORED AS TEXTFILE
LOCATION 'hdfs://servicename/user/directory'
example of current file names (they are texfiles and not compressed on the content level):
-rw-rw-r--+ /final/file11_20210601_0000.csv.gz
-rw-rw-r--+ /final/file12_20210601_0015.csv.gz
-rw-rw-r--+ /final/file12_20210601_0045.csv.gz
-rw-rw-r--+ /final/file1_20210601_0015.csv.gz
So far, I have found no way regarding the external table properties to read ".gz" files in external table, but I could make a shell script to change all files and remove the ".gz" from their ends:
for f in $(hdfs dfs -ls -t -r /user/dir/ | awk '{print $8}');do
v=$(echo "$f" | cut -d'.' -f1,2)
hdfs dfs -mv "$f" "$v"
done
But I am still open to solutions to read the .gz files directly in the external table.