I have a C# program which receives a Parquet file data stream and needs to output its schema as a dictionary with key as column name and value as column type (as a string). I am currently using Parquet.ParquetReader, but I cannot seem to access the converted/ logical types of columns, only the physical type. My code currently is as follows, using ParquetReader.Schema.Fields:
public Dictionary<string, string> GetSchema(Stream stream)
{
Ensure.ArgIsNotNull(stream, nameof(stream));
Dictionary<string, string> types = new Dictionary<string, string>();
try
{
var parquetOptions = new ParquetOptions { TreatByteArrayAsString = true };
var parquetFile = new ParquetReader(stream, parquetOptions);
var schema = parquetFile.Schema;
Dictionary<string, string> types = new Dictionary<string, string>();
foreach (Parquet.Data.DataField field in schema.Fields)
{
var typeName = TranslateParquetType(field);
types.Add(field.Name, typeName);
}
}
catch(Exception ex)
{
throw new DataSourceReadException(ParquetDataFormat.Instance.Name, ex.MessageEx());
}
}
private string TranslateParquetType(Parquet.Data.DataField dataField)
{
if (field.SchemaType == Parquet.Data.SchemaType.Map ||
field.SchemaType == Parquet.Data.SchemaType.Struct ||
field.SchemaType == Parquet.Data.SchemaType.List)
{
return "nested";
}
switch(field.DataType)
{
case Parquet.Data.DataType.Int16:
return "short";
case Parquet.Data.DataType.Int32:
return "int";
case Parquet.Data.DataType.Int64:
return "long";
}
}
I would like to be able to differentiate between an Int64 column that represents a number and an Int64 column that represents a timestamp. I know that this can be specified in the metadata of the Parquet file. However, I only have access in the ParquetReader .Schema to the physical type of each column.
For example I have a file whose schema looks like this (using parquet-cli, in parenthesees you can see the converted/logical type):
required group schema {
optional int32 Int32;
optional int64 Int64;
optional int64 Timestampms (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
I will add to my dictionary Int64 as the type for third column, although I would like to add timestamp (in accordance with the logical/converted type shown in parenthesees).
Is there some way to access the logical type of the parquet columns in C#, as opposed to only the physical type?
Found it! Parquet.ParquetReader has a property ThriftMetadata which has Schema. This is a list of SchemaElement objects, each with Type and ConvertedType and LogicalType.