Say I have a single file "fruitsbought.csv" that contains many records that contain a date field.
Is it possible to "partition" for better performance by creating the "fruits" table based on that text file, while creating a partition in which all the rows in fruitsbought.txt that would match that partition, say if I wanted to do it by year and month, to be created?
Or do I have to as part of a separate process, create a directory for each year and then put the appropriate ".csv" files that are filtered down for that year into the directory structure on HDFS prior to creating the table in impala-shell?
I heard that you can create an empty table, set up partitions, then use "Insert" statements that happen to contain the partition that that record goes into. Though in my current case, I already have a single "fruitsbought.csv" that contains every record I want in it that I like how I can just make that into a table right there (though it does not have parititionig).
Do I have to develop a separte process to presplit the one file into the multiple files sorted under the right partition? (The one file is very very big).
Create external table using fruitsbought.csv example (id is just example, ...- mean rest of columns in table):
CREATE EXTERNAL TABLE fruitsboughexternal
(
id INT,
.....
mydate STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'somelocationwithfruitsboughtfile/';
Create table with partition on date
CREATE TABLE fruitsbought(id INT, .....)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
INSERT INTO fruitsbought PARTITION(year, month, day) SELECT id, ..., year(mydate), month(mydate), day(mydate) FROM fruitsboughexternal;