Search code examples
c#avrosnappyspark-avro

How do I access the data in a Avro.snz file with C#


I have an Avro.snz file whose avro.codecs is snappy This can be opened with com.databricks.avro in Spark but it seems snappy is unsupported by Apache.Avro and Confluent.Avro, they only have deflate and null. Although they can get me the Schema, I cannot get at the data.

The next method gets and error. Ironsnappy is unable to decompress the file too, it says the input is

using (Avro.File.IFileReader<generic> reader = Avro.File.DataFileReader<generic>.OpenReader(avro_path))
{
    schema = reader.GetSchema();
    Console.WriteLine(reader.HasNext()); //true
    var hi = reader.Next(); // error
    Console.WriteLine(hi.ElementAt(0).ToString()); // error
}

I'm starting to wonder if there is anything in the Azure HDInsight library, but I cant seem to find the nuget package that gives me a way to read Avro with support for Snappy compression.

I'm open to any solution, even if that means downloading the source for Apache.Avro and adding in Snappy support manually, but to be honest, I'm sort of a newbie and have no idea how compression even works let alone add support to a library.

Can anyone help?

Update: Just adding the snappy codec to Apache.Avro and changing the DeflateStream to Ironsnappy stream failed. It gave Corrupt input again. Is there anything anywhere that can open Snappy compressed Avro files with C#?

Or how do I determine what part of the Avro is snappy compressed and pass that to Ironsnappy.


Solution

  • The simplest solution would be to use:

    ResultModel resultObject = AvroConvert.Deserialize<ResultModel>(byte[] avroObject);
    

    From https://github.com/AdrianStrugala/AvroConvert

    • null
    • deflate
    • snappy
    • gzip

    codes are supported