Search code examples
impala

Is there a way to partition an existing text file with Impala without pre-splitting the files into the partitioned directories?


Say I have a single file "fruitsbought.csv" that contains many records that contain a date field.

Is it possible to "partition" for better performance by creating the "fruits" table based on that text file, while creating a partition in which all the rows in fruitsbought.txt that would match that partition, say if I wanted to do it by year and month, to be created?

Or do I have to as part of a separate process, create a directory for each year and then put the appropriate ".csv" files that are filtered down for that year into the directory structure on HDFS prior to creating the table in impala-shell?

I heard that you can create an empty table, set up partitions, then use "Insert" statements that happen to contain the partition that that record goes into. Though in my current case, I already have a single "fruitsbought.csv" that contains every record I want in it that I like how I can just make that into a table right there (though it does not have parititionig).

Do I have to develop a separte process to presplit the one file into the multiple files sorted under the right partition? (The one file is very very big).


Solution

    1. Create external table using fruitsbought.csv example (id is just example, ...- mean rest of columns in table):

      CREATE EXTERNAL TABLE fruitsboughexternal ( id INT, .....
      mydate STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'somelocationwithfruitsboughtfile/';

    2. Create table with partition on date

    CREATE TABLE fruitsbought(id INT, .....) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    1. Import data to fruitsbought table, partition parameters have to be last in select (of course mydate have to be in format understand by impala like 2014-06-20 06:05:25)

    INSERT INTO fruitsbought PARTITION(year, month, day) SELECT id, ..., year(mydate), month(mydate), day(mydate) FROM fruitsboughexternal;