Search code examples
thriftparquet

Parquet file logical type mapping


In parquet files, data is stored in a small number of primitive types. There is, however, the concept of higher-order logical types (aka converted types). For example, a DECIMAL(10,2) may be stored as a byte array of length 3, i.e., an integer where the division by 100 to fixed-precision decimal is defined by the schema.

My question is this: where is there a map from numerical logical type to identifiers such as DECIMAL, and how are they further specified? As far as I understand, the schema thrift spec block looks like this: thrift_spec = (0, type(I32), type_length(I32), repetition_type(I32), name(string), num_children(I32), converted_type(I32), ... ) It is the meaning of the last variable I am after, and what further information may follow in the spec.


Solution

  • A brief description is given here, so I was right about DECIMALs. How exactly the other are used remains somewhat opaque.

    https://github.com/Parquet/parquet-format/blob/master/src/thrift/parquet.thrift#L65

    Specifically, the scale to multiply by is 10**b where b is the next 32-bit integer in the spec block.