Impala Data types

I'm trying to understand the difference between the following data tyoes in Impala

String
Char
Varchar

Being schema on read what would be the need for 3 different types. I am wondering if there's any performance benefits of using Char/Varchars over Strings in scenarios where we know the upper bounds on column lengths?

Solution

STRING stores variable length data and is (essentially--barring some practical limitations, of course) unbounded.

VARCHAR(x) stores variable length data with an upper bound of x characters, so data will be truncated to the defined length. For example, if you have VARCHAR(10), your input data can have size in [0,10].

CHAR(x) is an x-character fixed-size data type. Data is padded if it is shorter than x. Data is truncated if it is longer than x.

Both VARCHAR and CHAR were introduced in Impala 2.0.0 (CDH 5.2.0) and are mostly useful for compatibility with other database systems. However, both VARCHAR and CHAR are not recommended except for some special use cases (with specific legacy systems) as both have some functional limitations.

While there should be similar performance between STRING and VARCHAR, CHAR has some different characteristics: notably it is not codegen'ed so typically performance will suffer. However, small CHARs (where x < 128) are stored along with the tuples during execution rather than in auxiliary memory as variable length data is.

The above provides some differences between these types, but it is recommended to use STRING whenever possible.

See the STRING, VARCHAR, and CHAR docs for more details.