I have a binary file which can be seen as a concatenation of different sub-file:
INPUT FILE:
Hex Offset ID SortIndex
0000000 SubFile#1 3
0000AAA SubFile#2 1
0000BBB SubFile#3 2
...
FFFFFFF SubFile#N N
These are the information i have about each SubFile:
What's the fastest way to produce a Sorted Output File in your opinion ?
For instance OUTPUT FILE will contain the SubFile in the following order:
SubFile#2
SubFile#3
SubFile#1
...
I have thought about:
Consider the following information also:
Both my solution are easy to implement, but looks really not performing in my opinion.
Thanks in advance
Also if file is big the number of IDs is not so huge.
You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array. I expect this to be faster than other methods. So... let's make some pseudocode.
public struct FileItem : IComparable<FileItem>
{
public String Id;
public int SortIndex;
public uint Offset;
public uint Length;
public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}
public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
FileItem[] result = // fill the array
Array.Sort(result);
}
public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
foreach (FileItem item in items)
{
Copy from inputFile[item.Offset .. item.Length] to outputFile.
}
}
The number of read operations is linear, O(n), but seeking is required. The only performance problem about seeking is cache miss by hard drive cache. Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.
If you are using a solid state disk instead seeking time is 0 :)
Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly. You can ensure better time if you preallocate the size of the file before starting to write it.
FileStream myFileStream = ...
myFileStream.SetLength(predictedTotalSizeOfFile);
Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.
The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.
It is not useful to try a multithreaded approach, it will just slow down the copy.
It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).
Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code. Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.