Search code examples
c#java.netafp

Sorting Binary File Index Based


I have a binary file which can be seen as a concatenation of different sub-file:

INPUT FILE:

Hex Offset     ID           SortIndex
0000000        SubFile#1    3
0000AAA        SubFile#2    1
0000BBB        SubFile#3    2
...
FFFFFFF        SubFile#N    N

These are the information i have about each SubFile:

  • Starting Offset
  • Lenght in bytes
  • Final sequence Order

What's the fastest way to produce a Sorted Output File in your opinion ?

For instance OUTPUT FILE will contain the SubFile in the following order:

SubFile#2    
SubFile#3    
SubFile#1    
...

I have thought about:

  • Split the Input File extracting each Subfile to disk, then concatenate them in the correct order
  • Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.

Consider the following information also:

  • Input file can be really huge (200MB~1GB)
  • For those who knows, i am speaking about IBM AFP Files.

Both my solution are easy to implement, but looks really not performing in my opinion.

Thanks in advance


Solution

  • Also if file is big the number of IDs is not so huge.

    You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array. I expect this to be faster than other methods. So... let's make some pseudocode.

    public struct FileItem : IComparable<FileItem>
    {
        public String Id;
        public int SortIndex;
        public uint Offset;
        public uint Length;
    
        public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
    }
    
    public static FileItem[] LoadAndSortFileItems(FILE inputFile)
    {
        FileItem[] result = // fill the array
    
        Array.Sort(result);
    }
    
    public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
    {
        foreach (FileItem item in items)
        {
            Copy from inputFile[item.Offset .. item.Length] to outputFile.
        }
    }
    

    The number of read operations is linear, O(n), but seeking is required. The only performance problem about seeking is cache miss by hard drive cache. Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.

    If you are using a solid state disk instead seeking time is 0 :)

    Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly. You can ensure better time if you preallocate the size of the file before starting to write it.

     FileStream myFileStream = ...
     myFileStream.SetLength(predictedTotalSizeOfFile);
    

    Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.

    The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.

    It is not useful to try a multithreaded approach, it will just slow down the copy.

    It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).

    Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code. Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.