Search code examples
c#arrayssonarqubebufferfilestream

Accessing the underlying array of a Memory<T>


In my application I need to iterate over the contents of a file to make hashes of fixed-size chunks of the file. The end goal is to implement Amazon Glacier's Tree Hash algorithm, I almost verbatim copied the code from their documentation.

The problem happens when I run the following code through SonarQube though:

    byte[] buff = new byte[Mio];
    int bytesRead;

    while ((bytesRead = await inputStream.ReadAsync(buff, 0, Mio)) > 0) {
        // Process the bytes read
    }

I have a Roslyn issue on the line with the while loop. The issue is "Change the 'ReadAsync' method call to use the 'Stream.ReadAsync(Memory, CancellationToken)' overload". According to the description, it is because methods using the Memory class are more efficient than those using basic arrays.

That may be true when the class can used from end to end. The thing is, I need to feed the data to the ComputeHash method of a HashAlgorithm, and they don't have any override accepting a Memory. This means I must use the ToArray method of the Memory, which makes a copy of the data. This doesn't sound very efficient to me.

I know it is possible to create a Memory instance by passing an existing array to its constructor, like this:

    byte[] buff = new byte[Mio];
    Memory<byte> memory = new Memory<byte>(buff);
    int bytesRead;

    while ((bytesRead = await inputStream.ReadAsync(memory)) > 0) {
        // Use `buff` to access the bytes
    }

However the documentation is unclear whether the array passed to the constructor is actually used as the underlying storage for the Memory instance.

Thus, here are my questions:

  • How can I directly feed the data from a Memory to a HashAlgorithm instance? I'm talking about any instance of a class that derives from HashAlgorithm, not specifically the SHA256 algorithm. My implementation is not restricted to SHA256 unlike the Glacier one.
  • Is the data stored in the Memory instance also accesible in the array used to create it?
  • Is there another way to access the data stored in the Memory instance as an array, without copy?
  • Failing that, how can I silence an external issue in SonarQube (Roslyn warning in this case)? I don't have the dropdown to change its status like normal Sonar issues.

EDIT Adding additional info on how the code works: It's the first part of AWS's example of computing a Glacier Tree Hash, the part that computes the first hashes of 1Mio blocks from the file.

These are the contents of the while loop above:


// Constructor of the class
// The class implements IDisposable to properly dispose of the Algorithm field
// Constructor is called like this
// `using TreeHash treeHash = new TreeHash(System.Security.Cryptography.SHA512.Create());`
public TreeHash(HashAlgorithm algo) {
    this.Algorithm = algo;
}


// Chunk hash generation loop
// first part of the tree hash algorithm
    byte[][] chunkHashes = new byte[numChunks][];

    byte[] buff = new byte[Mio];
    int bytesRead;
    int idx = 0;

    while ((bytesRead = await inputStream.ReadAsync(buff, 0, Mio)) > 0) {
        chunkHashes[idx++] = this.ComputeHash(buff, bytesRead);
    }


// Quick wrapper around the hash algorithm
// Also used by the second part of the tree hash computation
private byte[] ComputeHash(byte[] data, int count) => this.Algorithm.ComputeHash(data, 0, count);

I use the unprefixed versions of the hash algorithms by default but I can probably switch to the managed versions. And if needed the method can become non-async.


Solution

  • The following should work. It makes use of the MemoryPool<byte> to get an IMemoryOwner<byte> that we can use to retrieve our scratch buffer. We need a Memory<byte> to pass to the ReadAsync call, so we pass the Memory property of the IMemoryOwner<byte>.

    We then restructure the code to make use of the HashAlgorithm.TryComputeHash method which accepts a ReadOnlySpan<byte> as the source and a Span<byte> as the destination. We do allocate a new array (rather than using the ArrayPool) since you are keeping/storing the arrays.

    byte[][] chunkHashes = new byte[numChunks][]; 
    
    using var memory = MemoryPool<byte>.Shared.Rent(Mio);
    
    int bytesRead;
    int idx = 0; 
    
    while ((bytesRead = await inputStream.ReadAsync(memory.Memory, CancellationToken.None)) > 0) 
    { 
       var tempBuff = new byte[(int)Math.Ceiling(this.Algorithm.HashSize/8.0)];
       if (this.Algorithm.TryComputeHash(memory.Memory.Span[..bytesRead] /*1*/, tempBuff, out var hashWritten)) 
       {
          chunkHashes[idx++] = hashWritten == tempBuff.Length ? tempBuff : tempBuff[..hashWritten] /*2*/;
       } 
       else
          throw new Exception("buffer not big enough");
    }
    

    For the source, we pass the Span property of the Memory<bytes> buffer, which is again retrieved from the IMemoryOwner<byte>.Memory property. We slice it to the appropriate length based on the number of bytes read. The Span<byte> that we pass as the destination must be at least the size of the algorithm's HashSize property, which is the number of bits (not bytes) required for the hash. Since it's possible (though I believe unlikely) for an implementation to use a size that is not a multiple of 8 we ceiling the division to round up if necessary. We don't need to call AsSpan since there's an implicit conversion from T[].

    I believe* that the final number of bytes written will always be the same length as the HashSize. If/when it is, we simply make use of the original array. Otherwise we need to slice it to the correct length based on the number of hash bytes written.

    In the event the buffer is not large enough, TryComputeHash returns false and we throw an exception. I'm fairly positive that this won't happen to us since we are explicitly calculating the size based on the HashSize, but we handle the case anyways as best practice.

    I've passed CancellationToken.None, but you can supply your own token. I'm also using Range syntax instead of explicitly calling Slice. If that's unavailable to you or you simply don't like how it looks, you can can be explicit about it:

    /*1*/ memory.Memory.Span.Slice(0, bytesRead)
    /*2*/ tempBuff.AsSpan(0, hashWritten).ToArray()
    

    Some possible assumptions we can make:

    • assume HashSize is always multiple of 8
    • assume HashSize is always equal to the number of bytes written and don't slice the final array
    • assume that we're always providing a large enough buffer (which, per above would be the exact size needed) and remove the if and the Exception
    while ((bytesRead = await inputStream.ReadAsync(memory.Memory, CancellationToken.None)) > 0)
    {
       var tempBuff = new byte[this.Algorithm.HashSize/8];
       _ = this.Algorithm.TryComputeHash(memory.Memory.Span[..bytesRead], tempBuff, out _);
       chunkHashes[idx++] = tempBuff;
    } 
    

    * Unfortunately I cannot say with 100% certainty that these are valid assumptions to make. Most of the implementations I've viewed the source code of have a Debug.Assert verifying the buffer size and bytes written are the same so I do think they are reasonable. That said, I think I would personally stick with the more verbose option.

    You'll also note that I've removed your ComputeHash function. That's not to say you still can't use it, but I leave converting it to this Try-based Memory<> pattern as an exercise to the reader.