Search code examples
c#algorithmfilecompare

Implement recursive hashing algorithm


let's say file A has the bytes:

2
5
8
0
33
90
1
3
200
201
23
12
55

and I have a simple hashing algorithm where I store the sum of the last three consecutive bytes so:

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

so I will be able to represent file A as 15, 123, 204, 236

let's say I copy that file to a new computer B and I made some minor modifications and the bytes of file B are:

255 
2   
5   
8   
0   
33  
90  
1   
3   
200 
201 
23  
12
255
255 

"note the difference is an extra byte at the beginning of the file and 2 extra bytes at the end but the rest is very similar"

so I can perform the same algorithm to determine if some parts of the file are the same. Remember that file A was represented by the hash codes 15, 123, 204, 236 let's see if file B gives me some of those hash codes!

so on file B I will have to do it every 3 consecutive bytes

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

so from looking at that table I know that file B contains the bytes from file A plus additional ones because the hash codes match.

the reason why I show this algorithm is because it was of order n In other words I was able to calculate the hash of the last 3 consecutive bytes without having to iterate through them!

If I where to have a more complex algorithm such as doing md5 of the last 3 bytes then it will be of order n^3 that is because as I iterate through file B I will have to have an inner for loop that will compute the hash of the last three bytes.

So my question is:

how can I improve the algorithm keeping it of order n. That is computing the hash just once. If I use an existing hashing algorithm such as md5 the I will have to place an inner loop inside the algorithm that will significantly increase the order of the algorithm.

Note that it is possible to do the same thing with multiplication instead of addition. but the counter significantly grows really fast. Maybe I can combine multiplication and addition and subtraction...

Edit

Also if I google for:

recursive hashing functions in-grams

a lot of information comes up and I think those algorithms are very difficult to understand...

I have to implement this algorithm for a project that's why I am reinventing the wheel... I know there are a lot of algorithms out there.

Also an alternative solution that I was thinking was to perform the same algorithm plus another one that is strong. so on file A I will perform the same algorithm every 3 bytes plus md5 of each 3 bytes. On the second file I will just perform the second algorithm if the first one comes true....


Solution

  • Edit:

    The more I think about what you meant by "recursive", the more I doubt the solution I presented earlier is what you should implement to do anything useful.

    You probably want to implement a hash tree algorithm, which is a recursive operation.

    To do this you hash the list, divide the list in two, and recurse into those two sub-lists. Terminate either when your list is size 1, or a minimum desired hash size, as each level of recursion will double the size of your total hash output.

    Pseudo-code:

    create-hash-tree(input list, minimum size: default = 1):
      initialize the output list
      hash-sublist(input list, output list, minimum size)
      return output list
    
    hash-sublist(input list, output list, minimum size):
      add sum-based-hash(list) result to output list // easily swap hash styles here
      if size(input list) > minimum size:
        split the list into two halves
        hash-sublist(first half of list, output list, minimum size)
        hash-sublist(second half of list, output list, minimum size)
    
    sum-based-hash(list):
      initialize the running total to 0
    
      for each item in the list:
        add the current item to the running total
    
      return the running total
    

    The running time of this whole algorithm I think is O(hash(m)); m = n * (log(n) + 1), with hash(m) usually being linear time.

    The storage space is something like O(hash * s); s = 2n - 1, with the hash usually being constant size.

    Note that for C#, I'd make the output list a List<HashType>, but I'd make the input list an IEnumerable<ItemType> to save on storage space, and use Linq to quickly "split" the list without allocating two new sub-lists.

    Original:

    I think you can get this to be O(n + m) execution time; where n is the size of the list, m is the size of the running tally, and n < m (otherwise all sums would be equal).

    With Double-Ended Queue

    The memory consumption will be the stack size, plus size m for temporary storage.

    To do this, use a double-ended queue, and a running total. Push newly encountered values onto the list while adding to the running total, and when the queue reaches size m, pop off the list and subtract from the running total.

    Here's some pseudo-code:

    initialize the running total to 0
    
    for each item in the list:
      add the current item to the running total
      push the current value onto the end of the dequeue
      if dequeue.length > m:
        pop off the front of the dequeue
        subtract the popped value from the running total
      assign the running total to the current sum slot in the list
    
    reset the index to the beginning of the list
    
    while the dequeue isn't empty:
      add the item in the list at the current index to the running total
      pop off the front of the dequeue
      subtract the popped value from the running total
      assign the running total to the current sum slot in the list
      increment the index
    

    This is not recursive, it is iterative.

    A run of this algorithm looks like this (for m = 3):

    value   sum slot   overwritten sum slot
    2       2          92
    5       7          74
    8       15         70
    0       15         15
    33      46
    90      131
    1       124
    3       127
    200     294
    201     405
    23      427
    12      436
    55      291
    

    With indexing

    You can remove the queue and the re-assignment of any slots by taking a sum of the last m values to start with, and using an offset of your index instead of popping off a dequeue, e.g. array[i - m].

    This won't decrease your execution time as you still have to have two loops, one to build up the running tally, and another to populate all the values. But it will decrease your memory usage to stack-space only (effectively O(1)).

    Here's some pseudo-code:

    initialize the running total to 0
    
    for the last m items in the list:
      add those items to the running total
    
    for each item in the list:
      add the current item to the running total
      subtract the value of the item m slots earlier from the running total
      assign the running total to the current sum slot in the list
    

    The m slots earlier is the tricky part. You can either split this up into two loops:

    • One that indexes from the end of the list, minus m, plus i
    • One that indexes from i minus m

    Or you can use modulo arithmetic to "wrap" the value around when i - m < 0:

    int valueToSutract = array[(i - m) % n];