In Impala at the end of the CREATE TABLE-statement you can set the replication-factor as I understand:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
Anyhow, I'm a bit puzzled what pool_name
refers to. Is this the path in the HDFS where the data is stored?
Not exactly, it actually refers to an HDFS pool defined using hdfs cacheadmin -addPool...
command, see hdfs command guide. A pool, in turn, does contain a bunch of cache directives that reference hdfs paths to be cached. From apache doc:
A cache pool is an administrative entity used to manage groups of cache directives. Cache pools have UNIX-like permissions, which restrict which users and groups have access to the pool. Write permissions allow users to add and remove cache directives to the pool. Read permissions allow users to list the cache directives in a pool, as well as additional metadata. Execute permissions are unused.
Cache pools are also used for resource management. Pools can enforce a maximum limit, which restricts the number of bytes that can be cached in aggregate by directives in the pool. Normally, the sum of the pool limits will approximately equal the amount of aggregate memory reserved for HDFS caching on the cluster. Cache pools also track a number of statistics to help cluster users determine what is and should be cached.
Pools also can enforce a maximum time-to-live. This restricts the maximum expiration time of directives being added to the pool.
The details of how to use this HDFS feature in Impala can be found in the Impala Guide.