I have an app that is storing file-based data under a NTFS directory path which keys off the SHA-1 hash of the data. It has several really nice attributes (de-duplication, impervious to other metadata changes, etc.) but I'm curious about the best practices people have experienced for creating hash-based directory storage structures. My primary concern is the number of files/folders which can be realistically stored at a given folder depth.
Does anyone know what sorts of limitations I'll run into? If I were to dump them all into folders at the root of the storage path, I feel like I would severely limit the ability for the storage to grow. While it won't be a problem soon I'd rather have a structure that avoids this than try to restructure a massive storage later.
If I took an approach to chunk up the signature to create a deeper tree, is there any guidance on how much would I need to chunk it? Would something like this suffice?
StringBuilder foo = new StringBuilder(60);
// ...root, etc.
// SHA-1 always has a length of 40, chunk it up to distribute into smaller groups
// "\0000\0000000000000000\00000000000000000000"
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 0, 4);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 4, 16);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 20, 20);
Knowing that SHA-1 has a pretty decent distribution, I would have to assume that eventually there would be large clusters but that on average it would be evenly distributed. It is those clusters that I'm concerned about.
Are there performance penalties when accessing directory structures which are too wide? I know that Windows Explorer will choke, but what about programmatically accessing via C# / System.IO?
Thanks to the other answerers for their insight.
It sounds like from other questions around the web that NTFS can handle the sizes, but Windows Explorer and network operations will potentially choke at much lower thresholds. I ran a simulation of a very even random distribution similar to what SHA-1 would produce for a random set of 1,000,000 "files".
Windows Explorer definitely did not like a directory width of 4 as it very quickly approached the maximum (65536) for that level. I tweaked the top two directory lengths to be 3 each (4096 max), and put the remaining 34 digits in the third level to attempt to balance depth versus probability of too many directories per level. This seems to allow Windows Explorer to handle browsing the structure.
Here's my simulation:
const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
// simulate a very even distribution like SHA-1 would produce
RandomNumberGenerator rand = RandomNumberGenerator.Create();
byte[] sha1 = new byte[20];
Stopwatch watch = Stopwatch.StartNew();
for (int i=0; i<1000000; i++)
{
// populate bytes with a fake SHA-1
rand.GetBytes(sha1);
// format bytes into hex string
string hash = FormatBytes(sha1);
// C:\_Sha1Buckets
StringBuilder builder = new StringBuilder(Root, 60);
// \012\345\6789abcdef0123456789abcdef01234567\
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 0, 3);
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 3, 3);
builder.Append(Path.DirectorySeparatorChar);
builder.Append(hash, 6, 34);
builder.Append(Path.DirectorySeparatorChar);
Directory.CreateDirectory(builder.ToString());
if (i % 5000 == 0)
{
// write out timings every five thousand files to see if changes
writer.WriteLine("{0}: {1}", i, watch.Elapsed);
Console.WriteLine("{0}: {1}", i, watch.Elapsed);
watch.Reset();
watch.Start();
}
}
watch.Reset();
Console.WriteLine("Press any key to delete the directory structure...");
Console.ReadLine();
watch.Start();
Directory.Delete(Root, true);
writer.WriteLine("Delete took {0}", watch.Elapsed);
Console.WriteLine("Delete took {0}", watch.Elapsed);
}
After about fifty thousand, the simulation appears to slow down a bit (15-20 sec per 5000) but stays at that rate. The delete at the end took over 30 min on my machine!
The distributions work out like this for 1 million hashes:
That is very manageable within Windows Explorer and doesn't seem to get too deep or wide. Obviously if the distribution weren't this even, then we could run into problems, but only at the third level. The first two levels are bounded at 4096. I suppose if the target set were larger, we could add an additional level and gain a lot of growth potential. For my application 1 million is a very reasonable upper bound.
Anyone have any thoughts on the validity of such a test for determining directory structure heuristics?