I'm trying to understand the difference between the following data tyoes in Impala
Being schema on read what would be the need for 3 different types. I am wondering if there's any performance benefits of using Char/Varchars over Strings in scenarios where we know the upper bounds on column lengths?
STRING
stores variable length data and is (essentially--barring some practical limitations, of course) unbounded.
VARCHAR(x)
stores variable length data with an upper bound of x characters, so data will be truncated to the defined length. For example, if you have VARCHAR(10)
, your input data can have size in [0,10].
CHAR(x)
is an x-character fixed-size data type. Data is padded if it is shorter than x. Data is truncated if it is longer than x.
Both VARCHAR
and CHAR
were introduced in Impala 2.0.0 (CDH 5.2.0) and are mostly useful for compatibility with other database systems. However, both VARCHAR
and CHAR
are not recommended except for some special use cases (with specific legacy systems) as both have some functional limitations.
While there should be similar performance between STRING
and VARCHAR
, CHAR
has some different characteristics: notably it is not codegen'ed so typically performance will suffer. However, small CHAR
s (where x < 128
) are stored along with the tuples during execution rather than in auxiliary memory as variable length data is.
The above provides some differences between these types, but it is recommended to use STRING
whenever possible.