I have a gzip file as a blob in my Azure storage. The file contains multiple lines. I need to read it line by line and process it.
I can get it to write on a local text file as unencrypted. But I do not want to add an extra step.
I have a gzip file as a blob in my Azure storage. The file contains multiple lines. I need to read it line by line and process it. I can get it to write on a local text file as unencrypted. But I do not want to add an extra step
You can use the below code to read the compressed file directly from Azure storage, decompress it, and process it line by line without having to write the uncompressed file to disk first using C#.
Code:
using Azure.Storage.Blobs;
using System;
using System.IO;
using System.IO.Compression;
using System.Text;
using System.Threading.Tasks;
public class Program
{
public static async Task Main(string[] args)
{
string connectionString = xxxxx";
string containerName = "test";
string blobName = "data.gz";
BlobServiceClient blobServiceClient = new BlobServiceClient(connectionString);
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);
BlobClient blobClient = containerClient.GetBlobClient(blobName);
using (Stream blobStream = await blobClient.OpenReadAsync())
using (GZipStream decompressionStream = new GZipStream(blobStream, CompressionMode.Decompress))
using (StreamReader reader = new StreamReader(decompressionStream, Encoding.Default))
{
string line;
while ((line = await reader.ReadLineAsync()) != null)
{
// Process the line here
Console.WriteLine(line);
}
}
}
}
The above retrieves the GZIP
compressed blob in your Azure storage account, opens the blob stream with OpenReadAsync
, decompresses the stream with GZipStream
, and reads the contents of the blob line by line using a StreamReader
Output:
# STOCKHOLM 1.0
#=GF ID neur_chan
#=GF AC PF00065
#=GF KL This family has been killed
#=GF FW PF02931
#=GF CC This family has been killed and split into two, one family
#=GF CC for the extracellular ligand binding domain and one for the
#=GF CC transmembrane region
//
# STOCKHOLM 1.0
#=GF ID zn-protease
#=GF AC PF00099
#=GF KL This family has been killed
#=GF FW
#=GF CC This family was removed from Pfam so that more extensive
#=GF CC alignments could be built for each subfamily.