Search code examples
c#parquet

Write null values to Parquet file with Parquet.Net creates an unreadable parquet file


I'm using Parquet.Net (4.23.5) to write parquet file. I discovered that when I want to write the value null in a datacolumn the generated parquet file in unreadable.

So what do I do wrong

This is the simple code to test it:

var fields = new List<DataField>
{
    new DataField<int>("id"),
    new DataField<string?>("city")
};

var schema = new ParquetSchema(fields);

Parquet.Data.DataColumn[] columns = new Parquet.Data.DataColumn[2];
for (int i = 0; i < 2; i++)
{
    Type t = fields[i].ClrType;

    //var allData = getData(dataTable, i);
    columns[i] = t switch
    {
        Type when typeof(string) == t => new Parquet.Data.DataColumn(fields[i], new string?[] { "London", null}),/*"Derby" */
        Type when typeof(int) == t => new Parquet.Data.DataColumn(fields[i], new int[] { 1, 2 }),

        _ => throw new NotImplementedException(),
    };
}

using (Stream fileStream = System.IO.File.OpenWrite("c:\\test.parquet"))
{
    ParquetOptions parquetOptions = new ParquetOptions { TreatByteArrayAsString = true, UseDictionaryEncoding = true, UseDeltaBinaryPackedEncoding = false };

    using (ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, fileStream, parquetOptions))
    {
        parquetWriter.CompressionMethod = CompressionMethod.Gzip;
        parquetWriter.CompressionLevel = System.IO.Compression.CompressionLevel.Optimal;

        // create a new row group in the file
        using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
        {
            foreach (var item in columns)
            {
                await groupWriter.WriteColumnAsync(item);
            }
        }
    }
}

It creates the parquet file, but whe I try to read it with the ParQuetViewer , I cannot read the file

enter image description here


Solution

  • Your error is caused by this setting in your ParquetOptions: UseDeltaBinaryPackedEncoding = false

    It seems the Parquet.NET library doesn't handle nullables correctly when delta binary encoding isn't used. I even tested with the latest version of the library: 5.0.2.

    If you can live with delta binary encoding, setting this flag to its default true will resolve your error. But I would ultimately recommend opening a ticket in the project's repo to address the issue itself.

    Testing locally, when the flag is true I am able to open the parquet file without any issues:

    open file success