Search code examples
javaarraylistimpala

Insert into Impala table vs write to HDFS


I have about 10 thousand records (stored as ArrayList in Java). I want to insert these records to Impala.

Should I use insert into table partition values to directly insert to impala. (I am not sure how many records can be inserted in one sql statement.)

Or should I write these records to HDFS then alter impala table?

Which way is preferred? Or is there any other solutions?

And also if I do these in every 5 minutes, how can I avoid so many small files in one partition (partitioned by hour)? These will produce 12 small files in each partition, so will this affect the query speed?


Solution

  • The best you can do is to do:

    1. Create your table in impala as an external table associated with an HDFS route
    2. Make the insertions directly in HDFS, if possible daily, per hour is probably little
    3. Execute the invalidate metada $ TABLE_NAME command so that the data is visible

    I hope the answer serves you

    Regards!