Search code examples
c#.netsql-serverado.netsqlclient

Why does this approach with TarWriter use alot of memory with Streams for files?


I originally posted this question: how-can-i-create-a-tar-file-in-memory-from-several-varbinarymax-columns-stored which was helpful. But I realized we might need to create large .tar files, with files around 2GB each which may result in ~20GB .tar files. And the usage of MemoryStream would not work and I cannot allocate that mush memory on the server.

I also looked at: how-do-you-pack-a-large-file-into-a-tar where the answer implies that the method will not use a lot of memory, and that approach writes to a tar file on disk.

So my modified approach is to read the varbinary columns from SQL Server and create a temporary file on disk that I then stream back to calling client.

My current code:

var tempFileName = Path.GetTempFileName();
await using var tarFileStream = File.Create(tempFileName);

await using (var tarWriter = new TarWriter(tarFileStream, leaveOpen: true))
{
    await using var connection = new SqlConnection("YourConnectionString");
    const string query = @"
SELECT
  FileName,
  FileData
FROM FilesTable;
";
    await using var command = new SqlCommand(query, connection);
    await connection.OpenAsync();
    await using var reader = command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);

    while (await reader.ReadAsync())
    {                            
        var fileName = reader.GetString(0);
        await using var fileDataStream = reader.GetStream(1);

        var entry = new PaxTarEntry(TarEntryType.RegularFile, fileName)
        {
            DataStream = fileDataStream,
        };
        await tarWriter.WriteEntryAsync(entry);
    }
}

tarFileStream.Position = 0;

// Stream tarFileStream to response body..

When I try this approach and during the while-loop I get somewhere around 8GB allocated in LOH for a couple of 1.7GB files stored in SQL Server (FILESTTREAM). I can inspect the memory usage in JetBrains DotMemory and the memory is allocated in an underlying MemoryStream of TarWriter if I'm not mistaken.

Am I missing something or what can cause the large memory consumption? According to Jon Skeets answer I should be able to not consume that much memory?

EDIT 1: Tried to read same file from disk instead of the SQL Server with no problem (no memory consumption basically):

await using var tempFileStream = File.Create(Path.GetTempFileName());
await using (var tarWriter = new TarWriter(tempFileStream, leaveOpen: true))
{
    await using var file = File.Open("C:\\Users\\xyz\\Desktop\\BigFile.txt", FileMode.Open);
    await tarWriter.WriteEntryAsync(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
    {
        DataStream = file
    });
}

So it must be some problem when I set DataStream property when the stream comes from SqlClient.

dotMemory: dotmem1 dotmem2

EDIT 2: Reproducible example, reading and inserting a 836MB large text file to database (without FILESTREAM). Read it from database with GetStream and write the stream to TarWriter.

using System.Data;
using System.Formats.Tar;
using Microsoft.Data.SqlClient;

const string connectionString = "Integrated Security=true;Data Source=localhost;Initial Catalog=MyTestDatabase;" +
                                "User Id=username;Password=xy;Max pool size=200;Min pool size=10;Connection Timeout=30;" +
                                "Encrypt=false";

await using var connection = new SqlConnection(connectionString);
await using var largeFile = File.Open(@"C:\Users\xyz\Desktop\BigFile.txt", FileMode.Open);

await using var createAndInsertCommand = new SqlCommand(
    """
    CREATE TABLE [dbo].[Temp] ([Id] INT NOT NULL, [Data] VARBINARY (MAX) NOT NULL);
    INSERT INTO dbo.Temp VALUES (@Id, @Data);
    """, 
    connection);
createAndInsertCommand.Parameters.Add("@Id", SqlDbType.Int).Value = 1;
createAndInsertCommand.Parameters.Add("@Data", SqlDbType.VarBinary, size: -1).Value = largeFile;
await createAndInsertCommand.Connection.OpenAsync();
await createAndInsertCommand.ExecuteNonQueryAsync();

// Read from database
await using var command = new SqlCommand("SELECT TOP 1 Data FROM dbo.Temp", connection);
await using var reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);

if (await reader.ReadAsync())
{
    await using var fileData = reader.GetStream(0);
    
    await using var tempFileStream = File.Create(Path.GetTempFileName());
    await using var tarWriter = new TarWriter(tempFileStream, leaveOpen: true, format: TarEntryFormat.Pax);


    tarWriter.WriteEntry(new PaxTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
    {
        DataStream = fileData
    });
}

dotMemory_1 dotMemory_2


Solution

  • The Stream object returned by SQL Server isn't seekable which makes sense. The TarWriter code treats unseekable streams differently though, depending on the TAR format.

    Looking at the TarHeader.Write.cs code for PAX I see that all the data is buffered in a MemoryStream if the input stream isn't seekable. The Ustar format doesn't buffer.

    Try changing the format to Ustar and using a UstarTarEntry :

    await using var tarWriter = new TarWriter(tempFileStream, 
        leaveOpen: true,
        format: TarEntryFormat.Ustar);
    
    
    tarWriter.WriteEntry(new UstarTarEntry(TarEntryType.RegularFile, Guid.NewGuid().ToString())
    {
        DataStream = fileData
    });