let's say file A has the bytes:
2
5
8
0
33
90
1
3
200
201
23
12
55
and I have a simple hashing algorithm where I store the sum of the last three consecutive bytes so:
2
5
8 - = 8+5+2 = 15
0
33
90 - = 90+33+0 = 123
1
3
200 - = 204
201
23
12 - = 236
so I will be able to represent file A as 15, 123, 204, 236
let's say I copy that file to a new computer B and I made some minor modifications and the bytes of file B are:
255
2
5
8
0
33
90
1
3
200
201
23
12
255
255
"note the difference is an extra byte at the beginning of the file and 2 extra bytes at the end but the rest is very similar"
so I can perform the same algorithm to determine if some parts of the file are the same. Remember that file A was represented by the hash codes 15, 123, 204, 236
let's see if file B gives me some of those hash codes!
so on file B I will have to do it every 3 consecutive bytes
int[] sums; // array where we will hold the sum of the last bytes
255 sums[0] = 255
2 sums[1] = 2+ sums[0] = 257
5 sums[2] = 5+ sums[1] = 262
8 sums[3] = 8+ sums[2] = 270 hash = sums[3]-sums[0] = 15 --> MATHES FILE A!
0 sums[4] = 0+ sums[3] = 270 hash = sums[4]-sums[1] = 13
33 sums[5] = 33+ sums[4] = 303 hash = sums[5]-sums[2] = 41
90 sums[6] = 90+ sums[5] = 393 hash = sums[6]-sums[3] = 123 --> MATHES FILE A!
1 sums[7] = 1+ sums[6] = 394 hash = sums[7]-sums[4] = 124
3 sums[8] = 3+ sums[7] = 397 hash = sums[8]-sums[5] = 94
200 sums[9] = 200+ sums[8] = 597 hash = sums[9]-sums[6] = 204 --> MATHES FILE A!
201 sums[10] = 201+ sums[9] = 798 hash = sums[10]-sums[7] = 404
23 sums[11] = 23+ sums[10] = 821 hash = sums[11]-sums[8] = 424
12 sums[12] = 12+ sums[11] = 833 hash = sums[12]-sums[9] = 236 --> MATHES FILE A!
55 sums[13] = 55+ sums[12] = 888 hash = sums[13]-sums[10] = 90
255 sums[14] = 255+ sums[13] = 1143 hash = sums[14]-sums[11] = 322
255 sums[15] = 255+ sums[14] = 1398 hash = sums[15]-sums[12] = 565
so from looking at that table I know that file B contains the bytes from file A plus additional ones because the hash codes match.
the reason why I show this algorithm is because it was of order n In other words I was able to calculate the hash of the last 3 consecutive bytes without having to iterate through them!
If I where to have a more complex algorithm such as doing md5 of the last 3 bytes then it will be of order n^3 that is because as I iterate through file B I will have to have an inner for loop that will compute the hash of the last three bytes.
how can I improve the algorithm keeping it of order n. That is computing the hash just once. If I use an existing hashing algorithm such as md5 the I will have to place an inner loop inside the algorithm that will significantly increase the order of the algorithm.
Note that it is possible to do the same thing with multiplication instead of addition. but the counter significantly grows really fast. Maybe I can combine multiplication and addition and subtraction...
Also if I google for:
recursive hashing functions in-grams
a lot of information comes up and I think those algorithms are very difficult to understand...
I have to implement this algorithm for a project that's why I am reinventing the wheel... I know there are a lot of algorithms out there.
Also an alternative solution that I was thinking was to perform the same algorithm plus another one that is strong. so on file A I will perform the same algorithm every 3 bytes plus md5 of each 3 bytes. On the second file I will just perform the second algorithm if the first one comes true....
The more I think about what you meant by "recursive", the more I doubt the solution I presented earlier is what you should implement to do anything useful.
You probably want to implement a hash tree algorithm, which is a recursive operation.
To do this you hash the list, divide the list in two, and recurse into those two sub-lists. Terminate either when your list is size 1, or a minimum desired hash size, as each level of recursion will double the size of your total hash output.
Pseudo-code:
create-hash-tree(input list, minimum size: default = 1):
initialize the output list
hash-sublist(input list, output list, minimum size)
return output list
hash-sublist(input list, output list, minimum size):
add sum-based-hash(list) result to output list // easily swap hash styles here
if size(input list) > minimum size:
split the list into two halves
hash-sublist(first half of list, output list, minimum size)
hash-sublist(second half of list, output list, minimum size)
sum-based-hash(list):
initialize the running total to 0
for each item in the list:
add the current item to the running total
return the running total
The running time of this whole algorithm I think is O(hash(m)); m = n * (log(n) + 1)
, with hash(m)
usually being linear time.
The storage space is something like O(hash * s); s = 2n - 1
, with the hash usually being constant size.
Note that for C#, I'd make the output list a List<HashType>
, but I'd make the input list an IEnumerable<ItemType>
to save on storage space, and use Linq to quickly "split" the list without allocating two new sub-lists.
I think you can get this to be O(n + m)
execution time; where n
is the size of the list, m
is the size of the running tally, and n < m
(otherwise all sums would be equal).
The memory consumption will be the stack size, plus size m
for temporary storage.
To do this, use a double-ended queue, and a running total. Push newly encountered values onto the list while adding to the running total, and when the queue reaches size m
, pop off the list and subtract from the running total.
Here's some pseudo-code:
initialize the running total to 0
for each item in the list:
add the current item to the running total
push the current value onto the end of the dequeue
if dequeue.length > m:
pop off the front of the dequeue
subtract the popped value from the running total
assign the running total to the current sum slot in the list
reset the index to the beginning of the list
while the dequeue isn't empty:
add the item in the list at the current index to the running total
pop off the front of the dequeue
subtract the popped value from the running total
assign the running total to the current sum slot in the list
increment the index
This is not recursive, it is iterative.
A run of this algorithm looks like this (for m = 3
):
value sum slot overwritten sum slot
2 2 92
5 7 74
8 15 70
0 15 15
33 46
90 131
1 124
3 127
200 294
201 405
23 427
12 436
55 291
You can remove the queue and the re-assignment of any slots by taking a sum of the last m
values to start with, and using an offset of your index instead of popping off a dequeue, e.g. array[i - m]
.
This won't decrease your execution time as you still have to have two loops, one to build up the running tally, and another to populate all the values. But it will decrease your memory usage to stack-space only (effectively O(1)
).
Here's some pseudo-code:
initialize the running total to 0
for the last m items in the list:
add those items to the running total
for each item in the list:
add the current item to the running total
subtract the value of the item m slots earlier from the running total
assign the running total to the current sum slot in the list
The m slots earlier
is the tricky part. You can either split this up into two loops:
Or you can use modulo arithmetic to "wrap" the value around when i - m < 0
:
int valueToSutract = array[(i - m) % n];