Search code examples
c#azure-functionsazure-blob-storageparquetparquet.net

Writing Parquet files using Parquet.NET works with local file, but results in empty file in blob storage


We are using parquet.net to write parquet files. I've set up a simple schema containing 3 columns, and 2 rows:

        // Set up the file structure
        var UserKey = new Parquet.Data.DataColumn(
            new DataField<Int32>("UserKey"),
            new Int32[] { 1234, 12345}
        );

        var AADID = new Parquet.Data.DataColumn(
            new DataField<string>("AADID"),
            new string[] { Guid.NewGuid().ToString(), Guid.NewGuid().ToString() }
        );

        var UserLocale = new Parquet.Data.DataColumn(
            new DataField<string>("UserLocale"),
            new string[] { "en-US", "en-US" }
        );

        var schema = new Schema(UserKey.Field, AADID.Field, UserLocale.Field
        );

When using a FileStream to write to a local file, a file is created, and when the code finishes, I can see two rows in the file (which is 1 kb after):

            using (Stream fileStream = System.IO.File.OpenWrite("C:\\Temp\\Users.parquet")) {
                using (var parquetWriter = new ParquetWriter(schema, fileStream)) {
                    // Creare a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
                        groupWriter.WriteColumn(UserKey);
                        groupWriter.WriteColumn(AADID);
                        groupWriter.WriteColumn(UserLocale);
                    }
                }
            }

Yet, when I attempt to use the same to write to our blob storage, that only generates an empty file, and the data is missing:

// Open reference to Blob Container
CloudAppendBlob blob = OpenBlobFile(blobEndPoint, fileName);

using (MemoryStream stream = new MemoryStream()) {

    blob.CreateOrReplaceAsync();

    using (var parquetWriter = new ParquetWriter(schema, stream)) {
        // Creare a new row group in the file
        using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
            groupWriter.WriteColumn(UserKey);
            groupWriter.WriteColumn(AADID);
            groupWriter.WriteColumn(UserLocale);
        }
    
    // Set stream position to 0
    stream.Position = 0;
    blob.AppendBlockAsync(stream);
    return true;
}

...

public static CloudAppendBlob OpenBlobFile (string blobEndPoint, string fileName) {
    CloudBlobContainer container = new CloudBlobContainer(new System.Uri(blobEndPoint));
    CloudAppendBlob blob = container.GetAppendBlobReference(fileName);

    return blob;
}

Reading the documentation, I would think my implementation of the blob.AppendBlocAsync should do the trick, but yet I end up with an empty file. Would anyone have suggestions as to why this is and how I can resolve it so I actually end up with data in the file?

Thanks in advance.


Solution

  • The explanation for the file ending up empty is the line:

    blob.AppendBlockAsync(stream);
    

    Note how the function called has the Async suffix. This means it expects whatever is calling it to wait. I turned the function the code was in into an Async one, and had Visual Studio suggest the following change to the line:

    _ = await blob.AppendBlockAsync(stream);
    

    I'm not entirely certain what _ represents, and hovering my mouse over it doesn't reveal much more, other than it being a long data type, but the code now works as intended.