MemSQL pipeline from S3 inserting NULLs into DATE type columns

The Memsql pipeline is supposed to dump data from S3 into a columnstore table. The source files are in ORC format. they are then converted to Parquet. The files have certain columns with DATE datatype (yyyy-mm-dd).

The pipeline runs fine but inserts NULL into all the Date type columns.

Solution

The DATE values may be getting written to Parquet as int64 with a timestamp logical type annotation (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp). MemSQL doesn't currently automatically convert these to a format compatible with e.g. DATETIME or TIMESTAMP, but rather attempts to assign to the destination column as if by assigning an integer literal with the raw underlying value. This gives NULL rather than an error for MySQL compatibility reasons, though set global data_conversion_compatibility_level="7.0" will make it an error.

You can investigate by temporarily giving the problem column TEXT type and looking at the resulting value. If it's an integer string, the issue is as described above and you can use the SET clause of CREATE PIPELINE to transform the value to a compatible format via something like CREATE PIPELINE P AS LOAD DATA .... INTO TABLE T(@col_tmp <- parquet_field_name) SET col = timestampadd(microsecond, @col_tmp, from_unixtime(0));.

The value will be a count of some time unit since the the unix epoch in some time zone. The unit and time zone depends on the writer, but should become clear if you know which time it's supposed to represent. Once you know that, modify the expression above to correct for units and perhaps call convert_tz as necessary.

Yes, it's a pain. We'll be making it automatic.