Search code examples
c#.netparquet

C# Parquet file schema: reading logical/converted types


I have a C# program which receives a Parquet file data stream and needs to output its schema as a dictionary with key as column name and value as column type (as a string). I am currently using Parquet.ParquetReader, but I cannot seem to access the converted/ logical types of columns, only the physical type. My code currently is as follows, using ParquetReader.Schema.Fields:

        public Dictionary<string, string> GetSchema(Stream stream)
        {
            Ensure.ArgIsNotNull(stream, nameof(stream));
            Dictionary<string, string> types = new Dictionary<string, string>();
            
            try
            {
                var parquetOptions = new ParquetOptions { TreatByteArrayAsString = true };
                var parquetFile = new ParquetReader(stream, parquetOptions);
                var schema = parquetFile.Schema;
                
                Dictionary<string, string> types = new Dictionary<string, string>();
                foreach (Parquet.Data.DataField field in schema.Fields)
                {
                    var typeName = TranslateParquetType(field);
                    types.Add(field.Name, typeName);
                }
            }
            catch(Exception ex)
            {
                throw new DataSourceReadException(ParquetDataFormat.Instance.Name, ex.MessageEx());
            }
        }

        private string TranslateParquetType(Parquet.Data.DataField dataField)
        {
            if (field.SchemaType == Parquet.Data.SchemaType.Map ||
                field.SchemaType == Parquet.Data.SchemaType.Struct ||
                field.SchemaType == Parquet.Data.SchemaType.List)
            {
                return "nested";
            }
            
            switch(field.DataType)
            {
                case Parquet.Data.DataType.Int16:
                    return "short";
                case Parquet.Data.DataType.Int32:
                    return "int";
                case Parquet.Data.DataType.Int64:
                    return "long";

            }
        }

I would like to be able to differentiate between an Int64 column that represents a number and an Int64 column that represents a timestamp. I know that this can be specified in the metadata of the Parquet file. However, I only have access in the ParquetReader .Schema to the physical type of each column.

For example I have a file whose schema looks like this (using parquet-cli, in parenthesees you can see the converted/logical type):

required group schema {
  optional int32 Int32;
  optional int64 Int64;
  optional int64 Timestampms (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

I will add to my dictionary Int64 as the type for third column, although I would like to add timestamp (in accordance with the logical/converted type shown in parenthesees).

Is there some way to access the logical type of the parquet columns in C#, as opposed to only the physical type?


Solution

  • Found it! Parquet.ParquetReader has a property ThriftMetadata which has Schema. This is a list of SchemaElement objects, each with Type and ConvertedType and LogicalType.